research-article

An Empirical Analysis of Backward Compatibility in Machine Learning Systems

Authors:

Megha Srivastava,

Eric HorvitzAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 3272 - 3280

https://doi.org/10.1145/3394486.3403379

Published: 20 August 2020 Publication History

Abstract

In many applications of machine learning (ML), updates are performed with the goal of enhancing model performance. However, current practices for updating models rely solely on isolated, aggregate performance analyses, overlooking important dependencies, expectations, and needs in real-world deployments. We consider how updates, intended to improve ML models, can introduce new errors that can significantly affect downstream systems and users. For example, updates in models used in cloud-based classification services, such as image recognition, can cause unexpected erroneous behavior in systems that make calls to the services. Prior work has shown the importance of "backward compatibility" for maintaining human trust. We study challenges with backward compatibility across different ML architectures and datasets, focusing on common settings including data shifts with structured noise and ML employed in inferential pipelines. Our results show that (i) compatibility issues arise even without data shift due to optimization stochasticity, (ii) training on large-scale noisy datasets often results in significant decreases in backward compatibility even when model accuracy increases, and (iii) distributions of incompatible points align with noise bias, motivating the need for compatibility aware de-noising and robustness methods.

References

[1]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In ICSE-SEIP. IEEE.

[2]

Sean Andrist, Dan Bohus, Ece Kamar, and Eric Horvitz. 2017. What went wrong and why? diagnosing situated interaction failures in the wild. In ICSR.

[3]

Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In AAAI, Vol. 33. 2429--2437.

Digital Library

[4]

Jan Bosch. 2009. From software product lines to software ecosystems. In SPLC.

[5]

Veronika Cheplygina, Marleen de Bruijne, and Josien PW Pluim. 2019. Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Medical image analysis, Vol. 54 (2019), 280--296.

[6]

Veronika Cheplygina, Isabel Pino Peña, Jesper Holst Pederson, David Lynch, Lauge Sørensen, and Marleen de Bruijne. 2018. Transfer Learning for Multicenter Classification of Chronic Obstructive Pulmonary Disease. IEEE Journal of Biomedical and Health Informatics, Vol. 22, 5 (2018), 1486--1496.

[7]

Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In ICDE.

[8]

Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In SIGMOD, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, 541--552. https://doi.org/10.1145/2463676.2465327

Digital Library

[9]

T. E. de Campos, B. R. Babu, and M. Varma. 2009. Character recognition in natural images. In VISAPP.

[10]

FICO. 2018 (accessed February 13, 2020). Explainable machine learning challenge. https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=4fbc8.

[11]

Benoît Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, Vol. 25, 5 (2013), 845--869.

[12]

Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, Vol. 3, 4 (1999), 128--135.

[13]

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211 (2013).

[14]

Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. ICLR.

[15]

Lasse Holmstrom and Petri Koistinen. 1992. Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, Vol. 3 (1992).

Digital Library

[16]

Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2017. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML .

[17]

Andrej Karpathy. 2017. Software 2.0. https://medium.com/@karpathy/software-2-0-a64152b37c35

[18]

Ronald Kemker, Marc McClure, Angelina Abitino, Tyler L Hayes, and Christopher Kanan. 2018. Measuring catastrophic forgetting in neural networks. In AAAI.

[19]

Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, and Li Fei-Fei. 2016. The unreasonable effectiveness of noisy data for fine-grained recognition. In ECCV.

[20]

Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. (2009).

[21]

Robert William Kruppa and Ravinder Prakash. U.S. Patent 0 298 668, Dec. 2008. Method for fraud detection using multiple scan technologies.

[22]

Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2017. Adversarial Machine Learning at Scale. ICLR.

[23]

Edith Law and Luis von Ahn. 2011. Human computation. Synthesis lectures on artificial intelligence and machine learning, Vol. 5, 3 (2011), 1--121.

Digital Library

[24]

Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, Vol. 2 (2010).

[25]

Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from Noisy Labels with Distillation. ICCV, 1928--1936.

[26]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In ACL.

[27]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR.

[28]

Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109--165.

[29]

Gideon Mendels, Erica Cooper, Victor Soto, Julia Hirschberg, Mark Gales, Kate Knill, Anton Ragni, and Haipeng Wang. 2015. Improving Speech Recognition and Keyword Search for Low Resource Languages Using Web Data. ISCA (2015).

[30]

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. 2013. Learning with Noisy Labels. In NeurIPS .

[31]

Besmira Nushi, Ece Kamar, and Eric Horvitz. 2018. Towards accountable ai: Hybrid human-machine analyses for characterizing system failure. In HCOMP.

[32]

Besmira Nushi, Ece Kamar, Eric Horvitz, and Donald Kossmann. 2017. On human intellect and machine failures: Troubleshooting integrative machine learning systems. In AAAI.

[33]

Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., Vol. 23, 4 (2000), 3--13.

[34]

Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2019. Snorkel: Rapid training data creation with weak supervision. The VLDB Journal (2019), 1--22.

[35]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10, 11 (2017), 1190--1201. https://doi.org/10.14778/3137628.3137631

Digital Library

[36]

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In NeurIPS.

[37]

Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir D. Bourdev, and Rob Fergus. 2015. Training Convolutional Networks with Noisy Labels. In ICLR 2015.

[38]

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. 2019. An empirical study of example forgetting during deep neural network learning. ICLR.

[39]

Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. 2003. CAPTCHA: Using hard AI problems for security. In EUROCRYPT. Springer.

[40]

Junfeng Wen, Chun-Nam Yu, and Russell Greiner. 2014. Robust Learning under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification. ICML.

[41]

Jiawei Zhang, Yang Wang, Piero Molino, Lezhi Li, and David S Ebert. 2018. Manifold: A model-agnostic framework for interpretation and diagnosis of machine learning models. TVCG, Vol. 25, 1 (2018), 364--373.

Digital Library

[42]

Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael R Lyu, and Miryung Kim. 2019. An empirical study of common challenges in developing deep learning applications. In ISSRE.

Cited By

Openja MKhomh FFoundjem AJiang ZAbidi MHassan A(2024)An Empirical Study of Testing Machine Learning in the WildACM Transactions on Software Engineering and Methodology10.1145/368046334:1(1-63)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3680463
Shankar SGarcia RHellerstein JParameswaran A(2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653697
Zhao YShen YXiong YYang SXia WTu ZSchiele BSoatto S(2024)Elodi: Ensemble Logit Difference Inhibition for Positive-Congruent TrainingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339272446:12(7529-7541)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3392724
Show More Cited By

Index Terms

An Empirical Analysis of Backward Compatibility in Machine Learning Systems
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques

Recommendations

A Generalized Backward Compatibility Metric
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Retraining a classifier with new data is inseparable from ML/AI applications, but most of the existing ML methods do not take into account the backward compatibility of predictions. That is, although the overall performance of a new classifier is ...
Learning Backward Compatible Embeddings
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product ...
Verifying backwards compatibility of object-oriented libraries using Boogie
FTfJP '12: Proceedings of the 14th Workshop on Formal Techniques for Java-like Programs

Proving that a library is backwards compatible to an older version can be challenging, as the internal representation of the libraries might completely differ and the clients of the library are usually unknown. This is especially difficult in the setting ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
637
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)7

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Openja MKhomh FFoundjem AJiang ZAbidi MHassan A(2024)An Empirical Study of Testing Machine Learning in the WildACM Transactions on Software Engineering and Methodology10.1145/368046334:1(1-63)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3680463
Shankar SGarcia RHellerstein JParameswaran A(2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653697
Zhao YShen YXiong YYang SXia WTu ZSchiele BSoatto S(2024)Elodi: Ensemble Logit Difference Inhibition for Positive-Congruent TrainingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339272446:12(7529-7541)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3392724
Achille AKearns MKlingenberg CSoatto S(2024)AI model disgorgement: Methods and choicesProceedings of the National Academy of Sciences10.1073/pnas.2307304121121:18Online publication date: 19-Apr-2024
https://doi.org/10.1073/pnas.2307304121
Subbaswamy ASahiner BPetrick NPai VAdams RDiamond MSaria S(2024)A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperformnpj Digital Medicine10.1038/s41746-024-01275-67:1Online publication date: 21-Nov-2024
https://doi.org/10.1038/s41746-024-01275-6
Farshidi SRezaee KMazaheri SRahimi ADadashzadeh AZiabakhsh MEskandari SJansen S(2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024
https://doi.org/10.1007/s11257-024-09398-x
Chen RChen GLiao XXiong W(2024)Class-incremental learning via prototype similarity replay and similarity-adjusted regularizationApplied Intelligence10.1007/s10489-024-05695-554:20(9971-9986)Online publication date: 30-Jul-2024
https://doi.org/10.1007/s10489-024-05695-5
Bainomugisha ENakatumba‐Nabende J(2024)Developing and Deploying End‐to‐End Machine Learning Systems for Social Impact: A Rubric and Practical Artificial Intelligence Case Studies From African ContextsApplied AI Letters10.1002/ail2.100Online publication date: 27-Aug-2024
https://doi.org/10.1002/ail2.100
Khani FRibeiro MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Collaborative alignment of NLP modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667814(38958-38974)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667814
Liu MZhao CPeng XYu SWang HSha C(2023)Task-Oriented ML/DL Library Recommendation Based on a Knowledge GraphIEEE Transactions on Software Engineering10.1109/TSE.2023.328528049:8(4081-4096)Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1109/TSE.2023.3285280
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten