research-article

Public Access

PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models

Authors:

Susan B. DavidsonAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 447 - 462

https://doi.org/10.1145/3318464.3380571

Published: 31 May 2020 Publication History

Abstract

The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update. Much effort is being put in better understanding and debugging machine learning models, as well as in identifying and repairing errors in training datasets. Our focus is on how to assist these activities when they have to retrain the machine learning model after removing problematic training samples in cleaning or selecting different subsets of training data for interpretability. This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models.

Supplementary Material

MP4 File (3318464.3380571.mp4)

Presentation Video

Download
95.31 MB

References

[1]

Yael Amsterdamer, Susan B Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011a. Putting lipstick on pig: Enabling database-style workflow provenance. Proceedings of the VLDB Endowment, Vol. 5, 4 (2011), 346--357.

Digital Library

[2]

Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011b. Provenance for aggregate queries. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 153--164.

Digital Library

[3]

Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. Siam Review, Vol. 60, 2 (2018), 223--311.

[4]

Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In International conference on database theory. Springer, 316--330.

Digital Library

[5]

Peter Buneman and Wang-Chiew Tan. 2018. Data Provenance: What next? ACM SIGMOD Record, Vol. 47, 3 (2018), 5--16.

Digital Library

[6]

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015. 1721--1730.

Digital Library

[7]

James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, Vol. 1, 4 (2009), 379--474.

Digital Library

[8]

Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2201--2206.

Digital Library

[9]

R Dennis Cook. 1977. Detection of influential observation in linear regression. Technometrics, Vol. 19, 1 (1977), 15--18.

[10]

John Darzentas. 1984. Problem complexity and method efficiency in optimization. Journal of the Operational Research Society, Vol. 35, 5 (1984), 455--455.

[11]

Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.

Digital Library

[12]

Amol Deshpande and Samuel Madden. 2006. MauveDB: supporting model-based user views in database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 73--84.

Digital Library

[13]

Mohamad Dolatshah, Mathew Teoh, Jiannan Wang, and Jian Pei. 2018. Cleaning crowdsourced labels using oracles for statistical classification. Proceedings of the VLDB Endowment, Vol. 12, 4 (2018), 376--389.

Digital Library

[14]

Finale Doshi-Velez and Been Kim. 2017. A roadmap for a rigorous science of interpretability. arXiv preprint arXiv:1702.08608, Vol. 150 (2017).

[15]

Tommy Ellkvist, David Koop, Erik W Anderson, Juliana Freire, and Cláudio Silva. 2008. Using provenance to support real-time collaborative design of workflows. In International Provenance and Annotation Workshop. Springer, 266--279.

Digital Library

[16]

Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management .Morgan & Claypool Publishers.

Digital Library

[17]

Todd J Green, Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. 2007 b. Update exchange with mappings and provenance. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 675--686.

Digital Library

[18]

Todd J Green, Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. 2010. Provenance in ORCHESTRA. (2010).

[19]

Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007 a. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 31--40.

Digital Library

[20]

Todd J. Green and Val Tannen. 2017. The Semiring Framework for Database Provenance. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14--19, 2017. 93--99.

[21]

Priyank Gupta, Nick Koudas, Europa Shang, Ryan Johnson, and Calisto Zuzarte. 2015. Processing analytical workloads incrementally. arXiv preprint arXiv:1509.05066 (2015).

[22]

Sona Hasani, Saravanan Thirumuruganathan, Abolfazl Asudeh, Nick Koudas, and Gautam Das. 2018. Efficient construction of approximate ad-hoc ML models through materialization and reuse. Proceedings of the VLDB Endowment, Vol. 11, 11 (2018), 1468--1481.

Digital Library

[23]

Trevor Hastie and Robert Tibshirani. 1986. Generalized additive models. Statist. Sci., Vol. 1, 3 (1986), 297--318.

[24]

Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. arXiv preprint arXiv:1904.02285 (2019).

[25]

Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas E Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The ORCHESTRA collaborative data sharing system. ACM Sigmod Record, Vol. 37, 3 (2008), 26--32.

Digital Library

[26]

Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001--. SciPy: Open source scientific tools for Python. http://www.scipy.org/

[27]

Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2018. Model Assertions for Debugging Machine Learning. https://www-cs.stanford.edu/ matei/papers/2018/mlsys_model_assertions.pdf Preprint.

[28]

Hamed Karimi, Julie Nutini, and Mark Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 795--811.

Digital Library

[29]

Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1885--1894.

Digital Library

[30]

Rainer Kress. 1998. Interpolation .Springer New York, New York, NY, 151--188. https://doi.org/10.1007/978--1--4612-0599--9_8

[31]

Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).

[32]

Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. 2016a. Towards reliable interactive data cleaning: a user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. ACM, 9.

Digital Library

[33]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016b. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, Vol. 9, 12 (2016), 948--959.

Digital Library

[34]

Sanjay Krishnan and Eugene Wu. 2017. Palm: Machine learning explanations for iterative debugging. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 4.

Digital Library

[35]

Raunak Kumar and Mark Schmidt. 2017. Convergence rate of expectation-maximization. In 10th NIPS Workshop on Optimization for Machine Learning.

[36]

Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. 2012. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 9--48.

Digital Library

[37]

Zachary Lipton. 2016. The Mythos of Model Interpretability. CoRR, Vol. abs/1606.03490 (2016).

[38]

Yin Lou, Rich Caruana, and Johannes Gehrke. 2012. Intelligible models for classification and regression. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12--16, 2012. 150--158.

Digital Library

[39]

Milos Nikolic, Mohammed Elseidy, and Christoph Koch. 2014. LINVIEW: incremental view maintenance for complex analytical queries. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014. 253--264.

Digital Library

[40]

Huazhong Ning, Wei Xu, Yun Chi, Yihong Gong, and Thomas S Huang. 2010. Incremental spectral clustering by efficiently updating the eigen-system. Pattern Recognition, Vol. 43, 1 (2010), 113--127.

Digital Library

[41]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.

[42]

Boris T Polyak and Anatoli B Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, Vol. 30, 4 (1992), 838--855.

Digital Library

[43]

Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1723--1726.

Digital Library

[44]

Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810 (2018).

[45]

Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., Vol. 23, 4 (2000), 3--13.

[46]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400--407.

[47]

Mark Schmidt. 2014. Convergence rate of stochastic gradient with constant step size. (2014).

[48]

Jennifer She and Mark Schmidt. 2017. Linear convergence and support vector identification of sequential minimal optimization. In 10th NIPS Workshop on Optimization for Machine Learning. 5.

[49]

Alan Weiser and Sergio E Zarantonello. 1988. A note on piecewise linear and multilinear table interpolation in many dimensions. Math. Comp., Vol. 50, 181 (1988), 189--196.

[50]

Zhepeng Yan, Val Tannen, and Zachary G Ives. 2016. Fine-grained Provenance for Linear Algebra Operators. In TaPP.

[51]

Wenchao Zhou, Micah Sherr, Tao Tao, Xiaozhou Li, Boon Thau Loo, and Yun Mao. 2010. Efficient querying and maintenance of network provenance at internet-scale. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 615--626.

Digital Library

Cited By

Srivastava S(2025)Classification Based Unlearning Through PCL Knowledge DistillationInnovations in Electrical and Electronics Engineering10.1007/978-981-97-9112-5_18(303-317)Online publication date: 31-Jan-2025
https://doi.org/10.1007/978-981-97-9112-5_18
Bayram FAhmed B(2024)Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps ApproachACM Computing Surveys10.1145/370849757:5(1-35)Online publication date: 18-Dec-2024
https://dl.acm.org/doi/10.1145/3708497
Zhang JZhou WUjcich B(2024)Provenance-Enabled Explainable AIProceedings of the ACM on Management of Data10.1145/36988262:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698826
Show More Cited By

Index Terms

PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models

Recommendations

Semisupervised Regression with Cotraining-Style Algorithms

The traditional setting of supervised learning requires a large amount of labeled training examples in order to achieve good generalization. However, in many practical applications, unlabeled training examples are readily available but labeled ones are ...
A multi-scheme semi-supervised regression approach
Highlights
- A new wrapper algorithm is proposed that can be utilized on semi-supervised regression problems.
Abstract
The production of vast amounts of data has increased the necessity of applying Machine Learning (ML) and Pattern Recognition (PR) methods that could perform accurate predictive performance without demanding much human effort for ...
Graphical abstract

Display Omitted
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
933
Total Downloads

Downloads (Last 12 months)211
Downloads (Last 6 weeks)17

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Srivastava S(2025)Classification Based Unlearning Through PCL Knowledge DistillationInnovations in Electrical and Electronics Engineering10.1007/978-981-97-9112-5_18(303-317)Online publication date: 31-Jan-2025
https://doi.org/10.1007/978-981-97-9112-5_18
Bayram FAhmed B(2024)Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps ApproachACM Computing Surveys10.1145/370849757:5(1-35)Online publication date: 18-Dec-2024
https://dl.acm.org/doi/10.1145/3708497
Zhang JZhou WUjcich B(2024)Provenance-Enabled Explainable AIProceedings of the ACM on Management of Data10.1145/36988262:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698826
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Zuo ZTang ZWang BLi KDatta A(2024)ECIL-MU: Embedding Based Class Incremental Learning and Machine UnlearningICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446273(6275-6279)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446273
Aliyeva LAbdullayev NZarbiyeva SHuseynov ISuleymanov U(2024)Deep Unlearning of Breast Cancer Histopathological Images for Enhanced Responsibility in Classification2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT)10.1109/AICT61888.2024.10740413(1-6)Online publication date: 25-Sep-2024
https://doi.org/10.1109/AICT61888.2024.10740413
Zhao MChen LYang KDu YGao Y(2023)Finding Materialized Models for Model ReuseIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327092335:12(12663-12678)Online publication date: 1-Dec-2023
https://doi.org/10.1109/TKDE.2023.3270923
Chundawat VTarun AMandal MKankanhalli M(2023)Zero-Shot Machine UnlearningIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.326550618(2345-2354)Online publication date: 2023
https://doi.org/10.1109/TIFS.2023.3265506
Zeng YWang JChen SJust HJin RJia R(2023)ModelPred: A Framework for Predicting Trained Model from Training Data2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)10.1109/SaTML54575.2023.00037(432-449)Online publication date: Feb-2023
https://doi.org/10.1109/SaTML54575.2023.00037
Assis AVéras DAndrade E(2023)Explainable Artificial Intelligence - An Analysis of the Trade-offs Between Performance and Explainability2023 IEEE Latin American Conference on Computational Intelligence (LA-CCI)10.1109/LA-CCI58595.2023.10409462(1-6)Online publication date: 29-Oct-2023
https://doi.org/10.1109/LA-CCI58595.2023.10409462
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten