Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3457239acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

Published: 18 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.
    We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.

    Supplementary Material

    MP4 File (3448016.3457239.mp4)
    Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law requires organisations that process personal data to delete user data upon request (enacting the right to be forgotten). However, it is not sufficient to merely delete the user data from databases. ML models that have been learned from the stored data often resemble a lossy compressed version of the data, and pose a potential privacy risk. We therefore argue that ML applications should also provide "privacy through deletion" by offering users to "unlearn" their data from trained models. In order to account for the constraints imposed by real world deployments, we introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in place under the removal of a small fraction of training samples without reaccessing its training data. We propose "HedgeCut", a classification model based on an ensemble of randomized decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.

    References

    [1]
    Pierre Andrews, Aditya Kalro, and Alexander Sidorov. 2016. Productionizing machine learning pipelines at scale. ML Systems workshop at ICML (2016).
    [2]
    Nima Asadi, Jimmy Lin, and Arjen P De Vries. 2013. Runtime optimizations for tree-based machine learning models. IEEE TKDE, Vol. 26, 9 (2013), 2281--2292.
    [3]
    Manos Athanassoulis. 2020. Let's talk about deletes! https://blogs.bu.edu/mathan/2020/06/29/lets-talk-about-deletes/.
    [4]
    Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. CIDR, Vol. 5. 225--237.
    [5]
    Joos-Hendrik Böse, Valentin Flunkert, Jan Gasthaus. 2017. Probabilistic demand forecasting at scale. VLDB, Vol. 10, 12 (2017), 1694--1705.
    [6]
    Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.
    [7]
    Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees .CRC press.
    [8]
    Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. IEEE Symposium on Security and Privacy (2015), 463--480.
    [9]
    Gert Cauwenberghs and Tomaso Poggio. 2001. Incremental and decremental support vector machine learning. NeurIPS (2001), 409--415.
    [10]
    Tianqi Chen 2016. Xgboost: A scalable tree boosting system. KDD .
    [11]
    Zeyu Ding, Yuxin Wang, Guanhong Wang, Danfeng Zhang, and Daniel Kifer. 2018. Detecting violations of differential privacy. CCS (2018), 475--489.
    [12]
    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. ToC (2006).
    [13]
    Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning . Vol. 1. Springer.
    [14]
    GDPR.eu. Article 17: Right to be forgotten . https://gdpr.eu/article-17-right-to-be-forgotten.
    [15]
    GDPR.eu. Recital 74: Responsibility and liability of the controller . https://gdpr.eu/recital-74-responsibility-and-liability-of-the-controller/.
    [16]
    GDPR.eu. Recital 75: Risks to the rights and freedoms of natural persons . https://gdpr.eu/recital-75-risks-to-the-rights-and-freedoms-of-natural-persons/.
    [17]
    Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine learning, Vol. 63, 1 (2006), 3--42.
    [18]
    Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Zou. 2019. Making AI Forget You: Data Deletion in Machine Learning. NeurIPS (2019).
    [19]
    Ashish Gupta, Inderpal Singh Mumick, et almbox. 1995. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Engineering, Vol. 18.
    [20]
    Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining time-changing data streams. KDD. 97--106.
    [21]
    Masayuki Karasuyama and Ichiro Takeuchi. 2009. Multiple Incremental Decremental Learning of Support Vector Machines. NeurIPS (2009), 907--915.
    [22]
    Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. VLDB, Vol. 11, 13 (2018), 2209--2222.
    [23]
    Ruey-Hsia Li and Geneva G Belford. 2002. Instability of decision tree classification algorithms. KDD. 570--575.
    [24]
    Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. CIDR (2013).
    [25]
    Xiangrui Meng, Joseph Bradley, et almbox. 2016. Mllib: Machine learning in apache spark. JMLR, Vol. 17, 1 (2016), 1235--1241.
    [26]
    Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. 2020. Descent-to-Delete: Gradient-Based Methods for Machine Unlearning. arxiv: stat.ML/2007.02923
    [27]
    Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-serving: Flexible, high-performance ml serving. ML Systems workshop at NeurIPS
    [28]
    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, et almbox. 2011. Scikit-learn: Machine learning in Python. JMLR, Vol. 12 (2011), 2825--2830.
    [29]
    Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. SIGMOD Record, Vol. 47, 2 (2018), 17--28.
    [30]
    Laura Elena Raileanu and Kilian Stoffel. 2004. Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence, Vol. 41, 1 (2004), 77--93.
    [31]
    Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, and Manos Athanassoulis. 2020. Lethe: A Tunable Delete-Aware LSM Engine. SIGMOD .
    [32]
    Sebastian Schelter. 2020. "Amnesia"--A Selection of Machine Learning Models That Can Forget User Data Very Fast. CIDR (2020).
    [33]
    Sebastian Schelter, Felix Biessmann, Tim Januschowski,et almbox. 2018. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin (2018).
    [34]
    David Sculley, Gary Holt,et almbox. 2015. Hidden technical debt in machine learning systems. NeurIPS. 2503--2511.
    [35]
    Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. 2020. Understanding and benchmarking the impact of GDPR on database systems. PVLDB (2020).
    [36]
    Julia Stoyanovich, Bill Howe, and H.V. Jagadish. 2020. Responsible Data Management. VLDB, Vol. 13, 12 (2020), 3474--3489.
    [37]
    Paul E Utgoff, Neil C Berkman, and Jeffery A Clouse. 1997. Decision tree induction based on efficient tree restructuring. Machine Learning, Vol. 29, 1 (1997), 5--44.
    [38]
    Louis Wehenkel and Mania Pavella. 1991. Decision trees and transient stability of electric power systems. Automatica, Vol. 27, 1 (1991), 115--134.
    [39]
    Yinjun Wu, Edgar Dobriban, and Susan B. Davidson. 2020. DeltaGrad: Rapid retraining of machine learning models. arxiv: cs.LG/2006.14755
    [40]
    Ting Ye, Hucheng Zhou, Will Y Zou, et almbox. 2018. Rapidscorer: fast tree ensemble evaluation by maximizing compactness in data level parallelization. KDD .
    [41]
    Jingren Zhou and Kenneth A Ross. 2002. Implementing database operations using SIMD instructions. SIGMOD . 145--156.

    Cited By

    View all
    • (2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
    • (2024)Machine Unlearning: Solutions and ChallengesIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33792408:3(2150-2168)Online publication date: Jun-2024
    • (2024)An overview of machine unlearningHigh-Confidence Computing10.1016/j.hcc.2024.100254(100254)Online publication date: Jul-2024
    • Show More Cited By

    Index Terms

    1. HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. decision trees
      2. machine unlearning
      3. serving systems

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)146
      • Downloads (Last 6 weeks)16

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
      • (2024)Machine Unlearning: Solutions and ChallengesIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33792408:3(2150-2168)Online publication date: Jun-2024
      • (2024)An overview of machine unlearningHigh-Confidence Computing10.1016/j.hcc.2024.100254(100254)Online publication date: Jul-2024
      • (2024)Mitigate noisy data for smart IoT via GAN based machine unlearningScience China Information Sciences10.1007/s11432-022-3671-967:3Online publication date: 2-Feb-2024
      • (2023)Certified minimax unlearning with generalization rates and deletion capacityProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668866(62821-62852)Online publication date: 10-Dec-2023
      • (2023)Equitable Data Valuation Meets the Right to Be Forgotten in Model MarketsProceedings of the VLDB Endowment10.14778/3611479.361153116:11(3349-3362)Online publication date: 1-Jul-2023
      • (2023)Machine Unlearning: A SurveyACM Computing Surveys10.1145/360362056:1(1-36)Online publication date: 28-Aug-2023
      • (2023)DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine UnlearningProceedings of the ACM on Management of Data10.1145/35893131:2(1-26)Online publication date: 20-Jun-2023
      • (2023)RUE: Realising Unlearning from the Perspective of Economics2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00159(1165-1172)Online publication date: 1-Nov-2023
      • (2023)QoSEraser: A Data Erasable Framework for Web Service QoS Prediction2023 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE60056.2023.00022(89-97)Online publication date: Jul-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media