Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Data Lifecycle Challenges in Production Machine Learning: A Survey

Published: 11 December 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and building robust processes and tools that make it easier to analyze, validate, and transform data that is fed into large-scale machine learning systems poses data management challenges. Drawn from our experience in developing data-centric infrastructure for a production machine learning platform at Google, we summarize some of the interesting research challenges that we encountered, and survey some of the relevant literature from the data management and machine learning communities. Specifically, we explore challenges in three main areas of focus - data understanding, data validation and cleaning, and data preparation. In each of these areas, we try to explore how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them.

    References

    [1]
    Deep learning for detection of diabetic eye disease. https://research.googleblog.com/2016/11/ deep-learning-for-detection-of-diabetic.html.
    [2]
    Kaggle. https://www.kaggle.com/.
    [3]
    Keras. https://keras.io/.
    [4]
    Mxnet. https://mxnet.incubator.apache.org/.
    [5]
    Tensorflow. https://www.tensorflow.org/.
    [6]
    S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eurosys, pages 29--42, 2013.
    [7]
    M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. R´e, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR, 2013.
    [8]
    M. R. Anderson and M. J. Cafarella. Input selection for fast feature engineering. In ICDE, pages 577--588, 2016.
    [9]
    P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556, 2017.
    [10]
    M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc., 1993.
    [11]
    D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich. Tfx: A tensorflow-based production-scale machine learning platform. In SIGKDD, pages 1387--1395, 2017.
    [12]
    Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35(8):1798--1828, 2013.
    [13]
    A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. CoRR, abs/1409.0798, 2014.
    [14]
    C. Binnig, L. D. Stefani, T. Kraska, E. Upfal, E. Zgraggen, and Z. Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.
    [15]
    M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. PVLDB, 9(13):1425--1436, 2016.
    [16]
    P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.
    [17]
    J.-H. B¨ose, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017.
    [18]
    K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott. Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics, 9:247--274, 2015.
    [19]
    M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.
    [20]
    R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A demo of the data civilizer system. In SIGMOD, pages 1639--1642, 2017.
    [21]
    B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. In PVLDB, pages 982--993, 2005.
    [22]
    F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011.
    [23]
    X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.
    [24]
    D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with velox. In CIDR, 2015.
    [25]
    V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. 33:1--28, 2014.
    [26]
    S. Dasgupta and J. Langford. Tutorial summary: Active learning. In ICML, page 18, 2009.
    [27]
    H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008.
    [28]
    F. Doshi-Velez and B. Kim. A roadmap for a rigorous science of interpretability. CoRR, abs/1702.08608, 2017.
    [29]
    R. C. Fernandez, Z. Abedjan, S. Madden, and M. Stonebraker. Towards large-scale data discovery: Position paper. In ExploreDB, pages 3--5, 2016.
    [30]
    R. A. Fisher. On the probable error of a coefficient of correlation deduced from a small sample. Metron, 1:3--32, 1921.
    [31]
    R. A. Fisher. Statistical Methods for Research Workers, pages 66--70. Springer New York, 1992.
    [32]
    A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419--435, 2002.
    [33]
    L. Golab, I. F. Ilyas, G. Beskales, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013.
    [34]
    I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.
    [35]
    A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806, 2016.
    [36]
    J. M. Hellerstein, V. Sreekanti, J. E. Gonzales, Sudhansku, Arora, A. Bhattacharyya, S. Das, A. Dey, M. Donsky, G. Fierro, S. Nag, K. Ramachandran, C. She, E. Sun, C. Steinbach, and V. Subramanian. Establishing common ground with data context. In CIDR, 2017.
    [37]
    A. Jenkinson. Beyond segmentation. Journal of Targeting, Measurement and Analysis for Marketing, (1):60--72, 1994.
    [38]
    M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, pages 906--917, 2016.
    [39]
    M. Kahng, D. Fang, and D. H. P. Chau. Visual exploration of machine learning results using data cube analysis. In HILDA, pages 1:1--1:6, 2016.
    [40]
    Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quian´e-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015.
    [41]
    M. Kim, T. Zimmermann, R. DeLine, and A. Begel. Data scientists in software teams: State of the art and challenges. TSE, PP(99):1--1, 2017.
    [42]
    S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.
    [43]
    P. Konda, A. Kumar, C. R´e, and V. Sashikanth. Feature selection in enterprise analytics: A demonstration using an r-based data analytics system. PVLDB, 6(12):1306--1309, 2013.
    [44]
    T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.
    [45]
    S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016.
    [46]
    A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Rec., 44(4):17--22, 2016.
    [47]
    A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34, 2016.
    [48]
    S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.
    [49]
    H. Miao, A. Chavan, and A. Deshpande. Provdb: A system for lifecycle management of collaborative analysis workflows. CoRR, abs/1610.04963, 2016.
    [50]
    H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In ICDE, pages 571--582, 2017.
    [51]
    T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
    [52]
    F. Olsson. A literature survey of active machine learning in the context of natural language processing. volume T2009 of SICS Technical Report. Swedish Institute of Computer Science, 2009.
    [53]
    C. Olston and B. Reed. Inspector gadget: A framework for custom monitoring and debugging of distributed dataflows. In SIGMOD, pages 1221--1224, 2011.
    [54]
    S. Palkar, J. J. Thomas, A. Shanbhag, M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia. A common runtime for high performance data analysis. In CIDR, 2017.
    [55]
    K. Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling, pages 11--28. Springer New York, 1992.
    [56]
    A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. R´e. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017.
    [57]
    A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. R´e. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016.
    [58]
    C. R´e, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 37(3):26--40, 2014.
    [59]
    A. Romei and S. Ruggieri. A multidisciplinary survey on discrimination analysis. Knowledge Eng. Review, 29(5):582--638, 2014.
    [60]
    G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional olap data. In VLDB, pages 531--540, 2001.
    [61]
    S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Automatically tracking metadata and provenance of machine learning experiments. In Workshop on ML Systems at NIPS 2017, 2017.
    [62]
    D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In NIPS, pages 2503--2511, 2015.
    [63]
    B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2012.
    [64]
    V. Shah, A. Kumar, and X. Zhu. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? PVLDB, 11(3):366--379, 2017.
    [65]
    V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In SIGKDD, pages 614--622, 2008.
    [66]
    T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: An expressive and interactive visual analytics system. PVLDB, 10(4):457--468, 2016.
    [67]
    E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In ICDE, pages 535--546, 2017.
    [68]
    M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.
    [69]
    M. Vartak. MODELDB: A system for machine learning model management. In CIDR, 2017.
    [70]
    M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015.
    [71]
    M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.
    [72]
    X. Wang, X. L. Dong, and A. Meliou. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245, 2015.
    [73]
    C. Zhang. DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, 2015.
    [74]
    C. Zhang, A. Kumar, and C. R´e. Materialization optimizations for feature selection workloads. ACM TODS, 41(1):2:1--2:32, 2016.
    [75]
    Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD, pages 527--540, 2017.

    Cited By

    View all
    • (2024)A Data-Driven Method for Water Quality Analysis and Prediction for Localized IrrigationAgriEngineering10.3390/agriengineering60201036:2(1771-1793)Online publication date: 18-Jun-2024
    • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
    • (2024)Engineering Carbon Emission-aware Machine Learning PipelinesProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644943(118-128)Online publication date: 14-Apr-2024
    • Show More Cited By

    Index Terms

    1. Data Lifecycle Challenges in Production Machine Learning: A Survey
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGMOD Record
      ACM SIGMOD Record  Volume 47, Issue 2
      June 2018
      68 pages
      ISSN:0163-5808
      DOI:10.1145/3299887
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 December 2018
      Published in SIGMOD Volume 47, Issue 2

      Check for updates

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)414
      • Downloads (Last 6 weeks)30
      Reflects downloads up to

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Data-Driven Method for Water Quality Analysis and Prediction for Localized IrrigationAgriEngineering10.3390/agriengineering60201036:2(1771-1793)Online publication date: 18-Jun-2024
      • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
      • (2024)Engineering Carbon Emission-aware Machine Learning PipelinesProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644943(118-128)Online publication date: 14-Apr-2024
      • (2024)Is a Seat at the Table Enough? Engaging Teachers and Students in Dataset Specification for ML in EducationProceedings of the ACM on Human-Computer Interaction10.1145/36373588:CSCW1(1-32)Online publication date: 26-Apr-2024
      • (2024)Couler: Unified Machine Learning Workflow Optimization in Cloud2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00393(5224-5237)Online publication date: 13-May-2024
      • (2024)Non-Invasive Fairness in Learning Through the Lens of Data Drift2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00172(2164-2178)Online publication date: 13-May-2024
      • (2024)The Efficacy of Autoencoders in the Utilization of Tabular Data for Classification TasksProcedia Computer Science10.1016/j.procs.2024.06.052238(492-502)Online publication date: 2024
      • (2024)Challenges and Opportunities of Using Transformer-Based Multi-Task Learning in NLP Through ML Lifecycle: A Position PaperNatural Language Processing Journal10.1016/j.nlp.2024.1000767(100076)Online publication date: Jun-2024
      • (2024)Enhancing Data Trustworthiness in Explorative Analysis: An Interactive Approach for Data Quality MonitoringSN Computer Science10.1007/s42979-024-02781-w5:5Online publication date: 20-Apr-2024
      • (2024)An empirical study of challenges in machine learning asset managementEmpirical Software Engineering10.1007/s10664-024-10474-429:4Online publication date: 1-Jul-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media