research-article

Data Lifecycle Challenges in Production Machine Learning: A Survey

Authors:

Neoklis Polyzotis,

Steven Euijong Whang,

Martin ZinkevichAuthors Info & Claims

ACM SIGMOD Record, Volume 47, Issue 2

Pages 17 - 28

https://doi.org/10.1145/3299887.3299891

Published: 11 December 2018 Publication History

Abstract

Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and building robust processes and tools that make it easier to analyze, validate, and transform data that is fed into large-scale machine learning systems poses data management challenges. Drawn from our experience in developing data-centric infrastructure for a production machine learning platform at Google, we summarize some of the interesting research challenges that we encountered, and survey some of the relevant literature from the data management and machine learning communities. Specifically, we explore challenges in three main areas of focus - data understanding, data validation and cleaning, and data preparation. In each of these areas, we try to explore how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them.

References

[1]

Deep learning for detection of diabetic eye disease. https://research.googleblog.com/2016/11/ deep-learning-for-detection-of-diabetic.html.

[2]

Kaggle. https://www.kaggle.com/.

[3]

Keras. https://keras.io/.

[4]

Mxnet. https://mxnet.incubator.apache.org/.

[5]

Tensorflow. https://www.tensorflow.org/.

[6]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eurosys, pages 29--42, 2013.

Digital Library

[7]

M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. R´e, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR, 2013.

[8]

M. R. Anderson and M. J. Cafarella. Input selection for fast feature engineering. In ICDE, pages 577--588, 2016.

[9]

P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556, 2017.

Digital Library

[10]

M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc., 1993.

Digital Library

[11]

D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich. Tfx: A tensorflow-based production-scale machine learning platform. In SIGKDD, pages 1387--1395, 2017.

Digital Library

[12]

Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35(8):1798--1828, 2013.

Digital Library

[13]

A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. CoRR, abs/1409.0798, 2014.

[14]

C. Binnig, L. D. Stefani, T. Kraska, E. Upfal, E. Zgraggen, and Z. Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.

[15]

M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. PVLDB, 9(13):1425--1436, 2016.

Digital Library

[16]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.

Digital Library

[17]

J.-H. B¨ose, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017.

Digital Library

[18]

K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott. Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics, 9:247--274, 2015.

[19]

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.

Digital Library

[20]

R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A demo of the data civilizer system. In SIGMOD, pages 1639--1642, 2017.

Digital Library

[21]

B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. In PVLDB, pages 982--993, 2005.

Digital Library

[22]

F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011.

Digital Library

[23]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.

Digital Library

[24]

D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with velox. In CIDR, 2015.

[25]

V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. 33:1--28, 2014.

Digital Library

[26]

S. Dasgupta and J. Langford. Tutorial summary: Active learning. In ICML, page 18, 2009.

Digital Library

[27]

H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008.

Digital Library

[28]

F. Doshi-Velez and B. Kim. A roadmap for a rigorous science of interpretability. CoRR, abs/1702.08608, 2017.

[29]

R. C. Fernandez, Z. Abedjan, S. Madden, and M. Stonebraker. Towards large-scale data discovery: Position paper. In ExploreDB, pages 3--5, 2016.

Digital Library

[30]

R. A. Fisher. On the probable error of a coefficient of correlation deduced from a small sample. Metron, 1:3--32, 1921.

[31]

R. A. Fisher. Statistical Methods for Research Workers, pages 66--70. Springer New York, 1992.

[32]

A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419--435, 2002.

[33]

L. Golab, I. F. Ilyas, G. Beskales, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013.

Digital Library

[34]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.

[35]

A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806, 2016.

Digital Library

[36]

J. M. Hellerstein, V. Sreekanti, J. E. Gonzales, Sudhansku, Arora, A. Bhattacharyya, S. Das, A. Dey, M. Donsky, G. Fierro, S. Nag, K. Ramachandran, C. She, E. Sun, C. Steinbach, and V. Subramanian. Establishing common ground with data context. In CIDR, 2017.

[37]

A. Jenkinson. Beyond segmentation. Journal of Targeting, Measurement and Analysis for Marketing, (1):60--72, 1994.

[38]

M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, pages 906--917, 2016.

[39]

M. Kahng, D. Fang, and D. H. P. Chau. Visual exploration of machine learning results using data cube analysis. In HILDA, pages 1:1--1:6, 2016.

Digital Library

[40]

Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quian´e-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015.

Digital Library

[41]

M. Kim, T. Zimmermann, R. DeLine, and A. Begel. Data scientists in software teams: State of the art and challenges. TSE, PP(99):1--1, 2017.

[42]

S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.

Digital Library

[43]

P. Konda, A. Kumar, C. R´e, and V. Sashikanth. Feature selection in enterprise analytics: A demonstration using an r-based data analytics system. PVLDB, 6(12):1306--1309, 2013.

Digital Library

[44]

T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.

[45]

S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016.

Digital Library

[46]

A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Rec., 44(4):17--22, 2016.

Digital Library

[47]

A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34, 2016.

Digital Library

[48]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.

Digital Library

[49]

H. Miao, A. Chavan, and A. Deshpande. Provdb: A system for lifecycle management of collaborative analysis workflows. CoRR, abs/1610.04963, 2016.

[50]

H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In ICDE, pages 571--582, 2017.

[51]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.

[52]

F. Olsson. A literature survey of active machine learning in the context of natural language processing. volume T2009 of SICS Technical Report. Swedish Institute of Computer Science, 2009.

[53]

C. Olston and B. Reed. Inspector gadget: A framework for custom monitoring and debugging of distributed dataflows. In SIGMOD, pages 1221--1224, 2011.

Digital Library

[54]

S. Palkar, J. J. Thomas, A. Shanbhag, M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia. A common runtime for high performance data analysis. In CIDR, 2017.

[55]

K. Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling, pages 11--28. Springer New York, 1992.

[56]

A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. R´e. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017.

Digital Library

[57]

A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. R´e. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016.

Digital Library

[58]

C. R´e, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 37(3):26--40, 2014.

[59]

A. Romei and S. Ruggieri. A multidisciplinary survey on discrimination analysis. Knowledge Eng. Review, 29(5):582--638, 2014.

[60]

G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional olap data. In VLDB, pages 531--540, 2001.

Digital Library

[61]

S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Automatically tracking metadata and provenance of machine learning experiments. In Workshop on ML Systems at NIPS 2017, 2017.

[62]

D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In NIPS, pages 2503--2511, 2015.

Digital Library

[63]

B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2012.

Digital Library

[64]

V. Shah, A. Kumar, and X. Zhu. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? PVLDB, 11(3):366--379, 2017.

Digital Library

[65]

V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In SIGKDD, pages 614--622, 2008.

Digital Library

[66]

T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: An expressive and interactive visual analytics system. PVLDB, 10(4):457--468, 2016.

Digital Library

[67]

E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In ICDE, pages 535--546, 2017.

[68]

M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.

[69]

M. Vartak. MODELDB: A system for machine learning model management. In CIDR, 2017.

[70]

M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015.

Digital Library

[71]

M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.

[72]

X. Wang, X. L. Dong, and A. Meliou. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245, 2015.

Digital Library

[73]

C. Zhang. DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, 2015.

[74]

C. Zhang, A. Kumar, and C. R´e. Materialization optimizations for feature selection workloads. ACM TODS, 41(1):2:1--2:32, 2016.

Digital Library

[75]

Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD, pages 527--540, 2017.

Digital Library

Cited By

Olawade DIjiwade JFapohunda OIge AOlajoyetan DWada O(2025)Predictive Modeling of PFAS Behavior and Degradation in Novel Treatment Scenarios: A ReviewProcess Safety and Environmental Protection10.1016/j.psep.2025.106869(106869)Online publication date: Feb-2025
https://doi.org/10.1016/j.psep.2025.106869
Yasmin JWang JTian YAdams B(2025)An empirical study of developers’ challenges in implementing Workflows as Code: A case study on Apache AirflowJournal of Systems and Software10.1016/j.jss.2024.112248219(112248)Online publication date: Jan-2025
https://doi.org/10.1016/j.jss.2024.112248
Ben-Shimol LLavi DKlevansky EBrodt OMimran DElovici YShabtai A(2025)Detection of compromised functions in a serverless cloud environmentComputers & Security10.1016/j.cose.2024.104261150(104261)Online publication date: Mar-2025
https://doi.org/10.1016/j.cose.2024.104261
Show More Cited By

Data Lifecycle Challenges in Production Machine Learning: A Survey
1. Computing methodologies

Recommendations

Data Management in Machine Learning: Challenges, Techniques, and Systems
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related ...
Data Management Challenges in Production Machine Learning
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such largescale pipelines, we focus on issues related to understanding, validating, ...
Big data, lifelong machine learning and transfer learning
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

A major challenge in today's world is the Big Data problem, which manifests itself in Web and Mobile domains as rapidly changing and heterogeneous data streams. A data-mining system must be able to cope with the influx of changing data in a continual ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 47, Issue 2

June 2018

68 pages

ISSN:0163-5808

DOI:10.1145/3299887

Editors:
Yanlei Diao
University of Massachusetts Amherst
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Marco Brambilla
Politecnico di Milano
,
Chee Yong Chan
National University of Singapore
,
Rada Chirkova
North Carolina State University
,
Zackary Ives
University of Pennsylvania
,
Anastasios Kementsietsidis
Google Research
,
Jeffrey Naughton
University of Wisconsin-Madison
,
Frank Neven
Hasselt University
,
Olga Papaemmanoui
Brandeis Univesity
,
Aditya Parameswaran
University of Illinois
,
Alkis Simitsis
HP Labs
,
Wang-Chiew Tan
University of California Santa Cruz
,
Pinar Tözü
IBM Almaden Research Center
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University

Issue’s Table of Contents

Copyright © 2018 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2018

Published in SIGMOD Volume 47, Issue 2

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

139
Total Citations
View Citations
2,500
Total Downloads

Downloads (Last 12 months)340
Downloads (Last 6 weeks)26

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Olawade DIjiwade JFapohunda OIge AOlajoyetan DWada O(2025)Predictive Modeling of PFAS Behavior and Degradation in Novel Treatment Scenarios: A ReviewProcess Safety and Environmental Protection10.1016/j.psep.2025.106869(106869)Online publication date: Feb-2025
https://doi.org/10.1016/j.psep.2025.106869
Yasmin JWang JTian YAdams B(2025)An empirical study of developers’ challenges in implementing Workflows as Code: A case study on Apache AirflowJournal of Systems and Software10.1016/j.jss.2024.112248219(112248)Online publication date: Jan-2025
https://doi.org/10.1016/j.jss.2024.112248
Ben-Shimol LLavi DKlevansky EBrodt OMimran DElovici YShabtai A(2025)Detection of compromised functions in a serverless cloud environmentComputers & Security10.1016/j.cose.2024.104261150(104261)Online publication date: Mar-2025
https://doi.org/10.1016/j.cose.2024.104261
da Silva RBenso MCorrêa FMessias TMendonça FMarques PDuarte SMendiondo EDelbem ASaraiva A(2024)A Data-Driven Method for Water Quality Analysis and Prediction for Localized IrrigationAgriEngineering10.3390/agriengineering60201036:2(1771-1793)Online publication date: 18-Jun-2024
https://doi.org/10.3390/agriengineering6020103
Andreou POsmolovskiy AHadjidemetriou PFdida S(2024)Towards Trustworthy Experimental Replication in SLICES-RI2024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619815(672-677)Online publication date: 3-Jun-2024
https://doi.org/10.23919/IFIPNetworking62109.2024.10619815
Shankar SLi HAsawa PHulsebos MLin YZamfirescu-Pereira JChase HFu-Hinthorn WParameswaran AWu E(2024)spade: Synthesizing Data Quality Assertions for Large Language Model PipelinesProceedings of the VLDB Endowment10.14778/3685800.368583517:12(4173-4186)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685835
Shankar SGarcia RHellerstein JParameswaran A(2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653697
Husom ESen SGoknil ABosch JLewis GCleland-Huang JMuccini H(2024)Engineering Carbon Emission-aware Machine Learning PipelinesProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644943(118-128)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3644815.3644943
Tan MLee HWang DSubramonyam H(2024)Is a Seat at the Table Enough? Engaging Teachers and Students in Dataset Specification for ML in EducationProceedings of the ACM on Human-Computer Interaction10.1145/36373588:CSCW1(1-32)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3637358
Oloruntoba AAsghari‐Jafarabadi MSashindranath MIngvar ÅAdler NVico‐Alonso CNiklasson LCaixinha AHiscutt EHolmes ZAssersen KAdamson SJegathees TBertelsen TVelasco‐Tamariz VHelkkula TKristiansen SToholka RGoh MChamberlain AMcCormack CVestergaard TMehta DNguyen TGe ZSoyer HMar V(2024)Assessment of image quality on the diagnostic performance of clinicians and deep learning models: Cross‐sectional comparative reader studyJournal of the European Academy of Dermatology and Venereology10.1111/jdv.20462Online publication date: 10-Dec-2024
https://doi.org/10.1111/jdv.20462
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents