Enabling Collaborative Data Science Development with the Ballet Framework

Published: 18 October 2021


While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, the first lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to software and ML performance validation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.

These are supplementary materials for the paper, "Enabling Collaborative Data Science Development with the Ballet Framework," by Micah J. Smith, J?rgen Cito, Kelvin Lu, and Kalyan Veeramachaneni.


Proceedings of the ACM on Human-Computer Interaction  Volume 5, Issue CSCW2
October 2021
5376 pages
Published: 18 October 2021
Published in PACMHCI Volume 5, Issue CSCW2

  collaborative framework
  data science
  feature definition
  feature engineering
  feature validation
  machine learning
  mutual information
  streaming feature selection


