Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394486.3403379acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

An Empirical Analysis of Backward Compatibility in Machine Learning Systems

Published: 20 August 2020 Publication History

Abstract

In many applications of machine learning (ML), updates are performed with the goal of enhancing model performance. However, current practices for updating models rely solely on isolated, aggregate performance analyses, overlooking important dependencies, expectations, and needs in real-world deployments. We consider how updates, intended to improve ML models, can introduce new errors that can significantly affect downstream systems and users. For example, updates in models used in cloud-based classification services, such as image recognition, can cause unexpected erroneous behavior in systems that make calls to the services. Prior work has shown the importance of "backward compatibility" for maintaining human trust. We study challenges with backward compatibility across different ML architectures and datasets, focusing on common settings including data shifts with structured noise and ML employed in inferential pipelines. Our results show that (i) compatibility issues arise even without data shift due to optimization stochasticity, (ii) training on large-scale noisy datasets often results in significant decreases in backward compatibility even when model accuracy increases, and (iii) distributions of incompatible points align with noise bias, motivating the need for compatibility aware de-noising and robustness methods.

References

[1]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In ICSE-SEIP. IEEE.
[2]
Sean Andrist, Dan Bohus, Ece Kamar, and Eric Horvitz. 2017. What went wrong and why? diagnosing situated interaction failures in the wild. In ICSR.
[3]
Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In AAAI, Vol. 33. 2429--2437.
[4]
Jan Bosch. 2009. From software product lines to software ecosystems. In SPLC.
[5]
Veronika Cheplygina, Marleen de Bruijne, and Josien PW Pluim. 2019. Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Medical image analysis, Vol. 54 (2019), 280--296.
[6]
Veronika Cheplygina, Isabel Pino Peña, Jesper Holst Pederson, David Lynch, Lauge Sørensen, and Marleen de Bruijne. 2018. Transfer Learning for Multicenter Classification of Chronic Obstructive Pulmonary Disease. IEEE Journal of Biomedical and Health Informatics, Vol. 22, 5 (2018), 1486--1496.
[7]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In ICDE.
[8]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In SIGMOD, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, 541--552. https://doi.org/10.1145/2463676.2465327
[9]
T. E. de Campos, B. R. Babu, and M. Varma. 2009. Character recognition in natural images. In VISAPP.
[10]
FICO. 2018 (accessed February 13, 2020). Explainable machine learning challenge. https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=4fbc8.
[11]
Benoît Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, Vol. 25, 5 (2013), 845--869.
[12]
Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, Vol. 3, 4 (1999), 128--135.
[13]
Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211 (2013).
[14]
Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. ICLR.
[15]
Lasse Holmstrom and Petri Koistinen. 1992. Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, Vol. 3 (1992).
[16]
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2017. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML .
[17]
Andrej Karpathy. 2017. Software 2.0. https://medium.com/@karpathy/software-2-0-a64152b37c35
[18]
Ronald Kemker, Marc McClure, Angelina Abitino, Tyler L Hayes, and Christopher Kanan. 2018. Measuring catastrophic forgetting in neural networks. In AAAI.
[19]
Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, and Li Fei-Fei. 2016. The unreasonable effectiveness of noisy data for fine-grained recognition. In ECCV.
[20]
Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. (2009).
[21]
Robert William Kruppa and Ravinder Prakash. U.S. Patent 0 298 668, Dec. 2008. Method for fraud detection using multiple scan technologies.
[22]
Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2017. Adversarial Machine Learning at Scale. ICLR.
[23]
Edith Law and Luis von Ahn. 2011. Human computation. Synthesis lectures on artificial intelligence and machine learning, Vol. 5, 3 (2011), 1--121.
[24]
Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, Vol. 2 (2010).
[25]
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from Noisy Labels with Distillation. ICCV, 1928--1936.
[26]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In ACL.
[27]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR.
[28]
Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109--165.
[29]
Gideon Mendels, Erica Cooper, Victor Soto, Julia Hirschberg, Mark Gales, Kate Knill, Anton Ragni, and Haipeng Wang. 2015. Improving Speech Recognition and Keyword Search for Low Resource Languages Using Web Data. ISCA (2015).
[30]
Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. 2013. Learning with Noisy Labels. In NeurIPS .
[31]
Besmira Nushi, Ece Kamar, and Eric Horvitz. 2018. Towards accountable ai: Hybrid human-machine analyses for characterizing system failure. In HCOMP.
[32]
Besmira Nushi, Ece Kamar, Eric Horvitz, and Donald Kossmann. 2017. On human intellect and machine failures: Troubleshooting integrative machine learning systems. In AAAI.
[33]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., Vol. 23, 4 (2000), 3--13.
[34]
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2019. Snorkel: Rapid training data creation with weak supervision. The VLDB Journal (2019), 1--22.
[35]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10, 11 (2017), 1190--1201. https://doi.org/10.14778/3137628.3137631
[36]
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In NeurIPS.
[37]
Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir D. Bourdev, and Rob Fergus. 2015. Training Convolutional Networks with Noisy Labels. In ICLR 2015.
[38]
Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. 2019. An empirical study of example forgetting during deep neural network learning. ICLR.
[39]
Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. 2003. CAPTCHA: Using hard AI problems for security. In EUROCRYPT. Springer.
[40]
Junfeng Wen, Chun-Nam Yu, and Russell Greiner. 2014. Robust Learning under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification. ICML.
[41]
Jiawei Zhang, Yang Wang, Piero Molino, Lezhi Li, and David S Ebert. 2018. Manifold: A model-agnostic framework for interpretation and diagnosis of machine learning models. TVCG, Vol. 25, 1 (2018), 364--373.
[42]
Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael R Lyu, and Miryung Kim. 2019. An empirical study of common challenges in developing deep learning applications. In ISSRE.

Cited By

View all
  • (2024)An Empirical Study of Testing Machine Learning in the WildACM Transactions on Software Engineering and Methodology10.1145/368046334:1(1-63)Online publication date: 24-Jul-2024
  • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
  • (2024)Elodi: Ensemble Logit Difference Inhibition for Positive-Congruent TrainingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339272446:12(7529-7541)Online publication date: Dec-2024
  • Show More Cited By

Index Terms

  1. An Empirical Analysis of Backward Compatibility in Machine Learning Systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
      August 2020
      3664 pages
      ISBN:9781450379984
      DOI:10.1145/3394486
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 August 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. backward compatibility
      2. machine learning
      3. reliability
      4. responsible data science

      Qualifiers

      • Research-article

      Conference

      KDD '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)63
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)An Empirical Study of Testing Machine Learning in the WildACM Transactions on Software Engineering and Methodology10.1145/368046334:1(1-63)Online publication date: 24-Jul-2024
      • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
      • (2024)Elodi: Ensemble Logit Difference Inhibition for Positive-Congruent TrainingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339272446:12(7529-7541)Online publication date: Dec-2024
      • (2024)AI model disgorgement: Methods and choicesProceedings of the National Academy of Sciences10.1073/pnas.2307304121121:18Online publication date: 19-Apr-2024
      • (2024)A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperformnpj Digital Medicine10.1038/s41746-024-01275-67:1Online publication date: 21-Nov-2024
      • (2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024
      • (2024)Class-incremental learning via prototype similarity replay and similarity-adjusted regularizationApplied Intelligence10.1007/s10489-024-05695-554:20(9971-9986)Online publication date: 30-Jul-2024
      • (2024)Developing and Deploying End‐to‐End Machine Learning Systems for Social Impact: A Rubric and Practical Artificial Intelligence Case Studies From African ContextsApplied AI Letters10.1002/ail2.100Online publication date: 27-Aug-2024
      • (2023)Collaborative alignment of NLP modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667814(38958-38974)Online publication date: 10-Dec-2023
      • (2023)Task-Oriented ML/DL Library Recommendation Based on a Knowledge GraphIEEE Transactions on Software Engineering10.1109/TSE.2023.328528049:8(4081-4096)Online publication date: 13-Jun-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media