Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3380571acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models

Published: 31 May 2020 Publication History

Abstract

The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update. Much effort is being put in better understanding and debugging machine learning models, as well as in identifying and repairing errors in training datasets. Our focus is on how to assist these activities when they have to retrain the machine learning model after removing problematic training samples in cleaning or selecting different subsets of training data for interpretability. This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models.

Supplementary Material

MP4 File (3318464.3380571.mp4)
Presentation Video

References

[1]
Yael Amsterdamer, Susan B Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011a. Putting lipstick on pig: Enabling database-style workflow provenance. Proceedings of the VLDB Endowment, Vol. 5, 4 (2011), 346--357.
[2]
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011b. Provenance for aggregate queries. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 153--164.
[3]
Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. Siam Review, Vol. 60, 2 (2018), 223--311.
[4]
Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In International conference on database theory. Springer, 316--330.
[5]
Peter Buneman and Wang-Chiew Tan. 2018. Data Provenance: What next? ACM SIGMOD Record, Vol. 47, 3 (2018), 5--16.
[6]
Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015. 1721--1730.
[7]
James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, Vol. 1, 4 (2009), 379--474.
[8]
Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2201--2206.
[9]
R Dennis Cook. 1977. Detection of influential observation in linear regression. Technometrics, Vol. 19, 1 (1977), 15--18.
[10]
John Darzentas. 1984. Problem complexity and method efficiency in optimization. Journal of the Operational Research Society, Vol. 35, 5 (1984), 455--455.
[11]
Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.
[12]
Amol Deshpande and Samuel Madden. 2006. MauveDB: supporting model-based user views in database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 73--84.
[13]
Mohamad Dolatshah, Mathew Teoh, Jiannan Wang, and Jian Pei. 2018. Cleaning crowdsourced labels using oracles for statistical classification. Proceedings of the VLDB Endowment, Vol. 12, 4 (2018), 376--389.
[14]
Finale Doshi-Velez and Been Kim. 2017. A roadmap for a rigorous science of interpretability. arXiv preprint arXiv:1702.08608, Vol. 150 (2017).
[15]
Tommy Ellkvist, David Koop, Erik W Anderson, Juliana Freire, and Cláudio Silva. 2008. Using provenance to support real-time collaborative design of workflows. In International Provenance and Annotation Workshop. Springer, 266--279.
[16]
Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management .Morgan & Claypool Publishers.
[17]
Todd J Green, Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. 2007 b. Update exchange with mappings and provenance. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 675--686.
[18]
Todd J Green, Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. 2010. Provenance in ORCHESTRA. (2010).
[19]
Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007 a. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 31--40.
[20]
Todd J. Green and Val Tannen. 2017. The Semiring Framework for Database Provenance. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14--19, 2017. 93--99.
[21]
Priyank Gupta, Nick Koudas, Europa Shang, Ryan Johnson, and Calisto Zuzarte. 2015. Processing analytical workloads incrementally. arXiv preprint arXiv:1509.05066 (2015).
[22]
Sona Hasani, Saravanan Thirumuruganathan, Abolfazl Asudeh, Nick Koudas, and Gautam Das. 2018. Efficient construction of approximate ad-hoc ML models through materialization and reuse. Proceedings of the VLDB Endowment, Vol. 11, 11 (2018), 1468--1481.
[23]
Trevor Hastie and Robert Tibshirani. 1986. Generalized additive models. Statist. Sci., Vol. 1, 3 (1986), 297--318.
[24]
Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. arXiv preprint arXiv:1904.02285 (2019).
[25]
Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas E Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The ORCHESTRA collaborative data sharing system. ACM Sigmod Record, Vol. 37, 3 (2008), 26--32.
[26]
Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001--. SciPy: Open source scientific tools for Python. http://www.scipy.org/
[27]
Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2018. Model Assertions for Debugging Machine Learning. https://www-cs.stanford.edu/ matei/papers/2018/mlsys_model_assertions.pdf Preprint.
[28]
Hamed Karimi, Julie Nutini, and Mark Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 795--811.
[29]
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1885--1894.
[30]
Rainer Kress. 1998. Interpolation .Springer New York, New York, NY, 151--188. https://doi.org/10.1007/978--1--4612-0599--9_8
[31]
Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).
[32]
Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. 2016a. Towards reliable interactive data cleaning: a user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. ACM, 9.
[33]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016b. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, Vol. 9, 12 (2016), 948--959.
[34]
Sanjay Krishnan and Eugene Wu. 2017. Palm: Machine learning explanations for iterative debugging. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 4.
[35]
Raunak Kumar and Mark Schmidt. 2017. Convergence rate of expectation-maximization. In 10th NIPS Workshop on Optimization for Machine Learning.
[36]
Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. 2012. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 9--48.
[37]
Zachary Lipton. 2016. The Mythos of Model Interpretability. CoRR, Vol. abs/1606.03490 (2016).
[38]
Yin Lou, Rich Caruana, and Johannes Gehrke. 2012. Intelligible models for classification and regression. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12--16, 2012. 150--158.
[39]
Milos Nikolic, Mohammed Elseidy, and Christoph Koch. 2014. LINVIEW: incremental view maintenance for complex analytical queries. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014. 253--264.
[40]
Huazhong Ning, Wei Xu, Yun Chi, Yihong Gong, and Thomas S Huang. 2010. Incremental spectral clustering by efficiently updating the eigen-system. Pattern Recognition, Vol. 43, 1 (2010), 113--127.
[41]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
[42]
Boris T Polyak and Anatoli B Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, Vol. 30, 4 (1992), 838--855.
[43]
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1723--1726.
[44]
Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810 (2018).
[45]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., Vol. 23, 4 (2000), 3--13.
[46]
Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400--407.
[47]
Mark Schmidt. 2014. Convergence rate of stochastic gradient with constant step size. (2014).
[48]
Jennifer She and Mark Schmidt. 2017. Linear convergence and support vector identification of sequential minimal optimization. In 10th NIPS Workshop on Optimization for Machine Learning. 5.
[49]
Alan Weiser and Sergio E Zarantonello. 1988. A note on piecewise linear and multilinear table interpolation in many dimensions. Math. Comp., Vol. 50, 181 (1988), 189--196.
[50]
Zhepeng Yan, Val Tannen, and Zachary G Ives. 2016. Fine-grained Provenance for Linear Algebra Operators. In TaPP.
[51]
Wenchao Zhou, Micah Sherr, Tao Tao, Xiaozhou Li, Boon Thau Loo, and Yun Mao. 2010. Efficient querying and maintenance of network provenance at internet-scale. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 615--626.

Cited By

View all
  • (2025)Classification Based Unlearning Through PCL Knowledge DistillationInnovations in Electrical and Electronics Engineering10.1007/978-981-97-9112-5_18(303-317)Online publication date: 31-Jan-2025
  • (2024)Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps ApproachACM Computing Surveys10.1145/370849757:5(1-35)Online publication date: 18-Dec-2024
  • (2024)Provenance-Enabled Explainable AIProceedings of the ACM on Management of Data10.1145/36988262:6(1-27)Online publication date: 20-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data provenance
  2. deletion propagation
  3. machine learning

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)211
  • Downloads (Last 6 weeks)17
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Classification Based Unlearning Through PCL Knowledge DistillationInnovations in Electrical and Electronics Engineering10.1007/978-981-97-9112-5_18(303-317)Online publication date: 31-Jan-2025
  • (2024)Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps ApproachACM Computing Surveys10.1145/370849757:5(1-35)Online publication date: 18-Dec-2024
  • (2024)Provenance-Enabled Explainable AIProceedings of the ACM on Management of Data10.1145/36988262:6(1-27)Online publication date: 20-Dec-2024
  • (2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
  • (2024)ECIL-MU: Embedding Based Class Incremental Learning and Machine UnlearningICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446273(6275-6279)Online publication date: 14-Apr-2024
  • (2024)Deep Unlearning of Breast Cancer Histopathological Images for Enhanced Responsibility in Classification2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT)10.1109/AICT61888.2024.10740413(1-6)Online publication date: 25-Sep-2024
  • (2023)Finding Materialized Models for Model ReuseIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327092335:12(12663-12678)Online publication date: 1-Dec-2023
  • (2023)Zero-Shot Machine UnlearningIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.326550618(2345-2354)Online publication date: 2023
  • (2023)ModelPred: A Framework for Predicting Trained Model from Training Data2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)10.1109/SaTML54575.2023.00037(432-449)Online publication date: Feb-2023
  • (2023)Explainable Artificial Intelligence - An Analysis of the Trade-offs Between Performance and Explainability2023 IEEE Latin American Conference on Computational Intelligence (LA-CCI)10.1109/LA-CCI58595.2023.10409462(1-6)Online publication date: 29-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media