Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Enabling Collaborative Data Science Development with the Ballet Framework

Published: 18 October 2021 Publication History

Abstract

While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, the first lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to software and ML performance validation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.

Supplementary Material

ZIP File (v5cscw431aux.zip)
These are supplementary materials for the paper, "Enabling Collaborative Data Science Development with the Ballet Framework," by Micah J. Smith, J?rgen Cito, Kelvin Lu, and Kalyan Veeramachaneni.

References

[1]
John M. Abowd, Gary L. Benedetto, Simson L. Garfinkel, Scot A. Dahl, Aref N. Dajani, Matthew Graham, Michael B. Hawes, Vishesh Karwa, Daniel Kifer, Hang Kim, Philip Leclerc, Ashwin Machanavajjhala, Jerome P. Reiter, Rolando Rodriguez, Ian M. Schmutte, William N. Sexton, Phyllis E. Singer, and Lars Vilhuber. 2020. The Modernization of Statistical Disclosure Limitation at the U.S. Census Bureau. Working Paper. U.S. Census Bureau.
[2]
American Community Survey Office. 2019. American Community Survey 2018 ACS 1-Year PUMS Files ReadMe. https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2018_PUMS_README.pdf . Accessed 2021-08--21.
[3]
Michael Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A Data System for Feature Engineering. In 6th Biennial Conference on Innovative Data Systems Research. 1--4.
[4]
Peter Bailis. 2020. Humans, Not Machines, Are the Main Bottleneck in Modern Analytics. https://sisudata.com/blog/humans-not-machines-are-the-bottleneck-in-modern-analytics.
[5]
Adam Baldwin. 2018. Details about the event-stream incident - The npm Blog. https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident. Accessed 2018--11--30.
[6]
Flore Barcellini, Françoise Détienne, and Jean-Marie Burkhardt. 2014. A situated approach of roles and participation in Open Source Software Communities. Human--Computer Interaction 29, 3 (2014), 205--255.
[7]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT '21). Association for Computing Machinery, New York, NY, USA, 610--623. https://doi.org/10.1145/3442188.3445922
[8]
James Bennett and Stan Lanning. 2007. The Netflix Prize. In Proceedings of KDD Cup and Workshop 2007. 1--4.
[9]
Evangelia Berdou. 2010. Organization in open source communities: At the crossroads of the gift and market economies. Routledge.
[10]
Andreas Böhm. 2004. Theoretical Coding: Text Analysis in. A companion to qualitative research 1 (2004).
[11]
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and A. Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In NIPS.
[12]
Nathan Bos, Ann Zimmerman, Judith Olson, Jude Yew, Jason Yerkie, Erik Dahl, and Gary Olson. 2007. From shared databases to communities of practice: A taxonomy of collaboratories. Journal of Computer-Mediated Communication 12, 2 (2007), 318--338.
[13]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Proceedings of the 2nd SysML Conference. 1--14.
[14]
Frederick P. Brooks Jr. 1995. The mythical man-month: essays on software engineering. Pearson Education.
[15]
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108--122.
[16]
José P. Cambronero, Jürgen Cito, and Martin C. Rinard. 2020. AMS: Generating AutoML Search Spaces from Weak Specifications. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Virtual Event USA, 763--774. https://doi.org/10.1145/3368089.3409700
[17]
Souti Chattopadhyay, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. 2020. What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, 1--12. https://doi.org/10.1145/3313831.3376729
[18]
Vincent Chen, Sen Wu, Alexander J. Ratner, Jen Weng, and Christopher Ré. 2019. Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices. In 33rd Conference on Neural Information Processing Systems. 1--11.
[19]
Justin Cheng and Michael S. Bernstein. 2015. Flock: Hybrid Crowd-Machine Learning Classifiers. Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing - CSCW '15 (2015), 600--611.
[20]
Joohee Choi and Yla Tausczik. 2017. Characteristics of Collaboration in the Emerging Practice of Open Data Analysis. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW '17. ACM Press, Portland, Oregon, USA, 835--846. https://doi.org/10.1145/2998181.2998265
[21]
Maximilian Christ, Nils Braun, Julius Neuffer, and Andreas W. Kempa-Liehr. 2018. Time series feature extraction on basis of scalable hypothesis tests (tsfresh--a python package). Neurocomputing 307 (2018), 72--77.
[22]
Carl Cook, Warwick Irwin, and Neville Churcher. 2005. A User Evaluation of Synchronous Collaborative Software Engineering Tools. In 12th Asia-Pacific Software Engineering Conference (APSEC'05). 1--6. https://doi.org/10.1109/APSEC.2005.22
[23]
Kevin Crowston, Jeff S. Saltz, Amira Rezgui, Yatish Hegde, and Sangseok You. 2019. Socio-Technical Affordances for Stigmergic Coordination Implemented in MIDST, a Tool for Data-Science Teams. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (Nov. 2019), 1--25. https://doi.org/10.1145/3359219
[24]
Dean De Cock. 2011. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19, 3 (2011).
[25]
Alexandre Decan, Tom Mens, and Maelick Claes. 2016. On the Topology of Package Dependency Networks: A Comparison of Three Programming Language Ecosystems. In Proccedings of the 10th European Conference on Software Architecture Workshops (Copenhagen, Denmark) (ECSAW '16). ACM, New York, NY, USA, Article 21, 4 pages.
[26]
Pedro Domingos. 2012. A Few Useful Things to Know about Machine Learning. Commun. ACM 55, 10 (Oct. 2012), 78--87. https://doi.org/10.1145/2347736.2347755
[27]
Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, 1--19.
[28]
Epidemic Prediction Initiative [n.d.]. Dengue Forecasting Project. https://web.archive.org/web/20190916180225/https: //predict.phiresearchlab.org/post/5a4fcc3e2c1b1669c22aa261. Accessed 2018-04--30.
[29]
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv:2003.06505 [cs, stat] (March 2020). arXiv:2003.06505 [cs, stat]
[30]
Utsav Garg, Viraj Prabhu, Deshraj Yadav, Ram Ramrakhya, Harsh Agrawal, and Dhruv Batra. 2018. Fabrik: An Online Collaborative Neural Network Editor. arXiv e-prints, Article arXiv:1810.11649 (2018). arXiv:1810.11649
[31]
Leonid Glanz, Patrick Müller, Lars Baumgärtner, Michael Reif, Sven Amann, Pauline Anthonysamy, and Mira Mezini. 2020. Hidden in plain sight: Obfuscated strings threatening your privacy. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 694--707.
[32]
GNU [n.d.]. The GNU Operating System. https://www.gnu.org.
[33]
Georgios Gousios, Martin Pinzger, and Arie van Deursen. 2014. An Exploratory Study of the Pull-Based Software Development Model. In Proceedings of the 36th International Conference on Software Engineering - ICSE 2014. ACM Press, Hyderabad, India, 345--355. https://doi.org/10.1145/2568225.2568260
[34]
Georgios Gousios, Margaret-Anne Storey, and Alberto Bacchelli. 2016. Work Practices and Challenges in Pull-Based Development: The Contributor's Perspective. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). 285--296. https://doi.org/10.1145/2884781.2884826
[35]
Georgios Gousios, Andy Zaidman, Margaret-Anne Storey, and Arie van Deursen. 2015. Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 358--368. https://doi.org/10.1109/ICSE.2015.55
[36]
Roger B. Grosse and David K. Duvenaud. 2014. Testing MCMC code. In 2014 NIPS Workshop on Software Engineering for Machine Learning. 1--8.
[37]
Isabelle Guyon and André Elisseeff. 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research (JMLR) 3, 3 (2003), 1157--1182.
[38]
Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. Vol. 52. Elsevier, 139--183. https://doi.org/10.1016/s0166--4115(08)62386--9
[39]
Øyvind Hauge, Claudia Ayala, and Reidar Conradi. 2010. Adoption of open source software in software-intensive organizations--A systematic literature review. Information and Software Technology 52, 11 (2010), 1133--1154.
[40]
Jeremy Hermann and Mike Del Balso. 2017. Meet Michelangelo: Uber's Machine Learning Platform. https://eng.uber.com/michelangelo-machine-learning-platform/. Accessed 2019-07-01.
[41]
Youyang Hou and Dakuo Wang. 2017. Hacking with NPOs: Collaborative Analytics and Broker Roles in Civic Data Hackathons. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (Dec. 2017), 1--16. https://doi.org/10.1145/3134688
[42]
Jez Humble and David Farley. 2010. Continuous delivery: reliable software releases through build, test, and deployment automation. Pearson Education.
[43]
Nick Hynes, D Sculley, and Michael Terry. 2017. The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets. Workshop on ML Systems at NIPS 2017 (2017).
[44]
Insight Lane 2019. Crash Model. https://github.com/insight-lane/crash-model.
[45]
Justin P. Johnson. 2006. Collaboration, Peer Review and Open Source Software. Information Economics and Policy 18, 4 (Nov. 2006), 477--497. https://doi.org/10.1016/j.infoecopol.2006.07.001
[46]
Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2020. Model Assertions for Monitoring and Improving ML Models. arXiv:2003.01668 [cs] (March 2020). arXiv:2003.01668 [cs]
[47]
James Max Kanter and Kalyan Veeramachaneni. 2015. Deep Feature Synthesis: Towards Automating Data Science Endeavors. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 1--10. https://doi.org/10.1109/DSAA.2015.7344858
[48]
Bojan Karla?, Matteo Interlandi, Cedric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. 2020. Building Continuous Integration Services for Machine Learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, Virtual Event CA USA, 2407--2415. https://doi.org/10.1145/3394486.3403290
[49]
Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. ExploreKit: Automatic Feature Generation and Selection. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, Barcelona, Spain, 979--984. https://doi.org/10.1109/ICDM.2016.0123
[50]
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. 2018. The Story in the Notebook: Exploratory Data Science Using a Literate Programming Tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). Association for Computing Machinery, New York, NY, USA, 1--11. https://doi.org/10.1145/3173574.3173748
[51]
Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. 2016. Cognito: Automated feature engineering for supervised learning. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 1304--1307.
[52]
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, and Carol Willing. 2016. Jupyter Notebooks -- a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, F. Loizides and B. Schmidt (Eds.). IOS Press, 87--90.
[53]
Ron Kohavi. 1996. Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. In KDD. 1--6.
[54]
Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics 69, 6 (2004), 1--16.
[55]
Thomas D. Latoza and André Van Der Hoek. 2016. Crowdsourcing in Software Engineering: Models, Opportunities, and Challenges. IEEE Software (2016), 1--13.
[56]
Haiguang Li, Xindong Wu, Zhao Li, and Wei Ding. 2013. Group feature selection with streaming features. Proceedings - IEEE International Conference on Data Mining, ICDM (2013), 1109--1114.
[57]
Linux [n.d.]. The Linux Kernel Organization. https://www.kernel.org.
[58]
Yaoli Mao, Dakuo Wang, Michael Muller, Kush R. Varshney, Ioana Baldini, Casey Dugan, and Aleksandra Mojsilovic. 2019. How Data Scientists Work Together With Domain Experts in Scientific Collaborations: To Find The Right Answer Or To Ask The Right Question? Proceedings of the ACM on Human-Computer Interaction 3, GROUP (Dec. 2019), 1--23. https://doi.org/10.1145/3361118
[59]
Michael Meli, Matthew R. McNiece, and Bradley Reaves. 2019. How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories. In Network and Distributed Systems Security (NDSS) Symposium. San Diego, CA, USA, 1--15. https://doi.org/10.14722/ndss.2019.23418
[60]
Meta Kaggle 2021. Meta Kaggle: Kaggle's public data on competitions, users, submission scores, and kernels. https://www.kaggle.com/kaggle/meta-kaggle. Version 539.
[61]
Justin Middleton, Emerson Murphy-Hill, and Kathryn T. Stolee. 2020. Data Analysts and Their Software Practices: A Profile of the Sabermetrics Community and Beyond. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (May 2020), 1--27. https://doi.org/10.1145/3392859
[62]
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). Association for Computing Machinery, New York, NY, USA, 1--15. https://doi.org/10.1145/3290605.3300356
[63]
William G. Ouchi. 1979. A conceptual framework for the design of organizational control mechanisms. Management science 25, 9 (1979), 833--848.
[64]
Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The Synthetic Data Vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399--410. https://doi.org/10.1109/DSAA.2016.49
[65]
Christian Payne. 2002. On the security of open source software. Information systems journal 12, 1 (2002), 61--78.
[66]
Zhenhui Peng, Jeehoon Yoo, Meng Xia, Sunghun Kim, and Xiaojuan Ma. 2018. Exploring How Software Developers Work with Mention Bot in GitHub. In Proceedings of the Sixth International Symposium of Chinese CHI on -- ChineseCHI '18. ACM Press, 152--155. https://doi.org/10.1145/3202667.3202694
[67]
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. Advances in neural information processing systems 29 (2016), 3567--3575.
[68]
Eric Raymond. 1999. The cathedral and the bazaar. Knowledge, Technology & Policy 12, 3 (1999), 23--49.
[69]
Cedric Renggli, Bojan Karla?, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang. 2019. Continuous Integration of Machine Learning Models With ease.ml/ci: Towards a Rigorous Yet Practical Treatment. In Proceedings of the 2nd SysML Conference. 1--12.
[70]
Jeffrey A. Roberts, Il-Horn Hann, and Sandra A. Slaughter. 2006. Understanding the motivations, participation, and performance of open source software developers: A longitudinal study of the Apache projects. Management science 52, 7 (2006), 984--999.
[71]
Andrew Slavin Ross and Jessica Zosa Forde. 2018. Refactoring Machine Learning. In Workshop on Critiquing and Correcting Trends in Machine Learning at NeuRIPS 2018. 1--6.
[72]
Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, Montreal QC Canada, 1--12. https://doi.org/10.1145/3173574.3173606
[73]
Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering 14, 2 (April 2009), 131--164. https://doi.org/10.1007/s10664-008--9102--8
[74]
Matthew J. Salganik, Ian Lundberg, Alexander T. Kindel, et al. 2020. Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration. Proceedings of the National Academy of Sciences 117, 15 (April 2020), 8398--8403. https://doi.org/10.1073/pnas.1915006117
[75]
Iflaah Salman and Burak Turhan. 2018. Effect of time-pressure on perceived and actual performance in functional software testing. In Proceedings of the 2018 International Conference on Software and System Process - ICSSP '18. ACM Press, 130--139. https://doi.org/10.1145/3202710.3203148
[76]
Gerald Schermann, Jürgen Cito, Philipp Leitner, and Harald Gall. 2016. Towards Quality Gates in Continuous Delivery and Deployment. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC). 1--4. https://doi.org/10.1109/ICPC.2016.7503737
[77]
D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems (2015), 2494--2502.
[78]
Ben Shneiderman, Catherine Plaisant, Maxine Cohen, Steven Jacobs, Niklas Elmqvist, and Nicholas Diakopoulos. 2016. Designing the user interface: strategies for effective human-computer interaction. Pearson.
[79]
Micah J. Smith. 2021. Collaborative, Open, and Automated Data Science. Ph.D. Thesis. Massachusetts Institute of Technology.
[80]
Micah J. Smith, Carles Sala, James Max Kanter, and Kalyan Veeramachaneni. 2020. The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). Association for Computing Machinery, Portland, OR, USA, 785--800. https://doi.org/10.1145/3318464.3386146
[81]
Micah J. Smith, Roy Wedge, and Kalyan Veeramachaneni. 2017. FeatureHub: Towards Collaborative Data Science. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 590--600.
[82]
Stockfish [n.d.]. Stockfish: A strong open source chess engine. https://stockfishchess.org. Accessed 2019-09-05.
[83]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3645--3650. https://doi.org/10.18653/v1/P19--1355
[84]
Krishna Subramanian, Nur Hamdan, and Jan Borchers. 2020. Casual Notebooks and Rigid Scripts: Understanding Data Science Programming. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1--5. https://doi.org/10.1109/VL/HCC50065.2020.9127207
[85]
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luís Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15 (2013), 49--60.
[86]
Bogdan Vasilescu, Stef van Schuylenburg, Jules Wulms, Aerebrenik Serebrenik, and Mark. G. J. van den Brand. 2014. Continuous Integration in a Social-Coding World: Empirical Evidence from GitHub. In 2014 IEEE International Conference on Software Maintenance and Evolution. 401--405. https://doi.org/10.1109/ICSME.2014.62
[87]
Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir Filkov. 2015. Quality and productivity outcomes relating to continuous integration in GitHub. Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2015 (2015), 805--816.
[88]
Kalyan Veeramachaneni, Una-May O'Reilly, and Colin Taylor. 2014. Towards Feature Engineering at Scale for Data from Massive Open Online Courses. arXiv:1407.5238 [cs] (2014). arXiv:1407.5238
[89]
Kiri L. Wagstaff. 2012. Machine Learning That Matters. In Proceedings of the 29th International Conference on Machine Learning. Edinburgh, Scotland, UK, 1--6.
[90]
April Yi Wang, Anant Mittal, Christopher Brooks, and Steve Oney. 2019. How Data Scientists Use Computational Notebooks for Real-Time Collaboration. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (Nov. 2019), 1--30. https://doi.org/10.1145/3359141
[91]
Dakuo Wang, Q. Vera Liao, Yunfeng Zhang, Udayan Khurana, Horst Samulowitz, Soya Park, Michael Muller, and Lisa Amini. 2021. How Much Automation Does a Data Scientist Want? arXiv:2101.03970 [cs] (Jan. 2021). arXiv:2101.03970 [cs]
[92]
Dakuo Wang, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray. 2019. Human-AI Collaboration in Data Science: Exploring Data Scientists' Perceptions of Automated AI. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (Nov. 2019), 1--24. https://doi.org/10.1145/3359313
[93]
Jing Wang, Meng Wang, Peipei Li, Luoqi Liu, Zhongqiu Zhao, Xuegang Hu, and Xindong Wu. 2015. Online Feature Selection with Group Structure Analysis. IEEE Transactions on Knowledge and Data Engineering 27, 11 (2015), 3029--3041.
[94]
Qianwen Wang, Yao Ming, Zhihua Jin, Qiaomu Shen, Dongyu Liu, Micah J. Smith, Kalyan Veeramachaneni, and Huamin Qu. 2019. ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, Glasgow Scotland Uk, 1--12. https://doi.org/10.1145/3290605.3300911
[95]
Sarah Wooders, Peter Schafhalter, and Joseph E. Gonzalez. 2021. Feature Stores: The Data Side of ML Pipelines. https://medium.com/riselab/feature-stores-the-data-side-of-ml-pipelines-7083d69bff1c.
[96]
Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. 2013. Online feature selection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5 (2013), 1178--1192.
[97]
Doris Xin, Eva Yiwei Wu, Doris Jung-Lin Lee, Niloufar Salehi, and Aditya Parameswaran. 2021. Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows. arXiv:2101.04834 [cs] (Jan. 2021). arXiv:2101.04834 [cs]
[98]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. In NeurIPS.
[99]
Qian Yang, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. 2018. Grounding Interactive Machine Learning Tool Design in How Non-Experts Actually Build Models. Proceedings of the 2018 on Designing Interactive Systems Conference 2018 - DIS '18 (2018), 573--584. https://doi.org/10.1145/3196709.3196729
[100]
Quanming Yao, Mengshuo Wang, Yuqiang Chen, Wenyuan Dai, Yu-Feng Li, Wei-Wei Tu, Qiang Yang, and Yang Yu. 2019. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. arXiv:1810.13306 [cs, stat] (Dec. 2019). arXiv:1810.13306 [cs, stat]
[101]
Kui Yu, Xindong Wu, Wei Ding, and Jian Pei. 2016. Scalable and Accurate Online Feature Selection for Big Data. TKDD 11 (2016), 16:1--16:39.
[102]
Liguo Yu and Srini Ramaswamy. 2007. Mining CVS repositories to understand open-source project developer roles. In Fourth International Workshop on Mining Software Repositories (MSR'07: ICSE Workshops 2007). IEEE, 1--8.
[103]
Amy X. Zhang, Michael Muller, and Dakuo Wang. 2020. How Do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (May 2020), 1--23. https://doi.org/10.1145/3392826
[104]
Yangyang Zhao, Alexander Serebrenik, Yuming Zhou, Vladimir Filkov, and Bogdan Vasilescu. 2017. The Impact of Continuous Integration on Other Software Development Practices: A Large-Scale Empirical Study. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 60--71. https://doi.org/10.1109/ASE.2017.8115619
[105]
Jing Zhou, Dean Foster, Robert Stine, and Lyle Ungar. 2005. Streaming feature selection using alpha-investing. Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD '05 (2005), 384--393.

Cited By

View all
  • (2025)A Process Model for AI‐Enabled Software DevelopmentJournal of Software: Evolution and Process10.1002/smr.274337:1Online publication date: 22-Jan-2025
  • (2024)Towards Feature Engineering with Human and AI’s Knowledge: Understanding Data Science Practitioners’ Perceptions in Human&AI-Assisted Feature Engineering DesignProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661517(1789-1804)Online publication date: 1-Jul-2024
  • (2023)A Systematic Analysis of Problems in Open Collaborative Data EngineeringACM Transactions on Social Computing10.1145/36290406:3-4(1-30)Online publication date: 9-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction
Proceedings of the ACM on Human-Computer Interaction  Volume 5, Issue CSCW2
CSCW2
October 2021
5376 pages
EISSN:2573-0142
DOI:10.1145/3493286
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021
Published in PACMHCI Volume 5, Issue CSCW2

Check for updates

Author Tags

  1. collaborative framework
  2. data science
  3. feature definition
  4. feature engineering
  5. feature validation
  6. machine learning
  7. mutual information
  8. streaming feature selection

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)217
  • Downloads (Last 6 weeks)20
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Process Model for AI‐Enabled Software DevelopmentJournal of Software: Evolution and Process10.1002/smr.274337:1Online publication date: 22-Jan-2025
  • (2024)Towards Feature Engineering with Human and AI’s Knowledge: Understanding Data Science Practitioners’ Perceptions in Human&AI-Assisted Feature Engineering DesignProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661517(1789-1804)Online publication date: 1-Jul-2024
  • (2023)A Systematic Analysis of Problems in Open Collaborative Data EngineeringACM Transactions on Social Computing10.1145/36290406:3-4(1-30)Online publication date: 9-Dec-2023
  • (2023)Computer-Mediated Sharing Circles for Intersectional Peer Support with Home Care WorkersProceedings of the ACM on Human-Computer Interaction10.1145/35794727:CSCW1(1-35)Online publication date: 16-Apr-2023
  • (2023)Social Contextualization of Datasets for Mental Health AI: a Review of Gender-linked Sociotechnical Misalignments2023 IEEE 11th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI57859.2023.00063(439-445)Online publication date: 26-Jun-2023
  • (2023)Examining a social-based system with personalized recommendations to promote mental health for college studentsSmart Health10.1016/j.smhl.2023.10038528(100385)Online publication date: Jun-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media