research-article

Public Access

When is memorization of irrelevant training data necessary for high-accuracy learning?

Authors:

Vitaly Feldman,

Kunal TalwarAuthors Info & Claims

STOC 2021: Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing

Pages 123 - 132

https://doi.org/10.1145/3406325.3451131

Published: 15 June 2021 Publication History

Abstract

Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning.

Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of text- and image-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.

References

[1]

Alexander A Alemi. 2020. Variational predictive information bottleneck. In Symposium on Advances in Approximate Bayesian Inference. Pages 1–6.

[2]

Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. 2019. Private PAC learning implies finite Littlestone dimension. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019. Pages 852–860.

Digital Library

[3]

Devansh Arpit, Stanislaw Jastrzkebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, and Yoshua Bengio. 2017. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. Pages 233–242.

[4]

Ziv Bar-Yossef, Thathachar S Jayram, Ravi Kumar, and D Sivakumar. 2004. An information statistics approach to data stream and communication complexity. J. Comput. System Sci., 68, 4, 2004. Pages 702–732.

Digital Library

[5]

Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. 2018. Learners that use little information. In Algorithmic Learning Theory. Pages 25–55.

[6]

Raef Bassily, Adam Smith, and Abhradeep Thakurta. 2014. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. Pages 464–473.

Digital Library

[7]

Amos Beimel, Kobbi Nissim, and Uri Stemmer. 2019. Characterizing the Sample Complexity of Pure Private Learners. Journal of Machine Learning Research, 20, 146, 2019. Pages 1–33.

[8]

Avrim Blum, Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Chen Li. 2005. Practical privacy: the SuLQ framework. In Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA. ACM. Pages 128–138. https://doi.org/10.1145/1065167.1065184

Digital Library

[9]

Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. 2020. When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning? arXiv preprint arXiv:2012.06421, 2020.

[10]

Mark Bun and Thomas Steinke. 2016. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference. Pages 635–658.

Digital Library

[11]

Mark Bun, Jonathan Ullman, and Salil Vadhan. 2018. Fingerprinting codes and the price of approximate differential privacy. SIAM J. Comput., 47, 5, 2018. Pages 1888–1938.

Digital Library

[12]

Nicholas Carlini, Chang Liu, \'Ulfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th \USENIX\ Security Symposium (\USENIX\ Security 19). Pages 267–284.

[13]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. Extracting Training Data from Large Language Models. 2020.

[14]

Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. Pages 202–210.

Digital Library

[15]

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. 2015. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems. Pages 2350–2358.

[16]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam D. Smith, Shai Halevi, and Tal Rabin. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006, Proceedings. Lecture Notes in Computer Science. 3876, Springer. Pages 265–284. https://doi.org/10.1007/11681878_14

Digital Library

[17]

Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. 2017. Exposed! a survey of attacks on private data. Annual Review of Statistics and Its Application, 4, 2017. Pages 61–84.

[18]

Vitaly Feldman. 2020. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. Pages 954–959.

Digital Library

[19]

Vitaly Feldman and David Xiao. 2014. Sample complexity bounds on differentially private learning via communication complexity. In Conference on Learning Theory. Pages 1000–1019.

[20]

Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33, 2020.

[21]

Sumegha Garg, Ran Raz, and Avishay Tal. 2018. Extractor-based time-space lower bounds for learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. Pages 990–1002.

Digital Library

[22]

Uri Hadar, Jingbo Liu, Yury Polyanskiy, and Ofer Shayevitz. 2019. Communication complexity of estimating correlations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. Pages 792–803.

Digital Library

[23]

Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM J. Comput., 40, 3, 2011. Pages 793–826.

Digital Library

[24]

Roi Livni and Shay Moran. 2020. A limitation of the pac-bayes framework. Advances in Neural Information Processing Systems, 33, 2020.

[25]

Siyuan Ma, Raef Bassily, and Mikhail Belkin. 2018. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning. Pages 3325–3334.

[26]

Andrew McGregor, Ilya Mironov, Toniann Pitassi, Omer Reingold, Kunal Talwar, and Salil P. Vadhan. 2010. The Limits of Two-Party Differential Privacy. In FOCS. Pages 81–90.

[27]

Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. 2018. A direct sum result for the information complexity of learning. arXiv preprint arXiv:1804.05474, 2018.

[28]

Ido Nachum and Amir Yehudayoff. 2019. Average-case information complexity of learning. In Algorithmic Learning Theory. Pages 633–646.

[29]

Adityanarayanan Radhakrishnan, Mikhail Belkin, and Caroline Uhler. 2019. Overparameterized neural networks can implement associative memory. arXiv preprint arXiv:1909.12362, 2019.

[30]

Ran Raz. 2018. Fast learning requires good memory: A time-space lower bound for parity learning. Journal of the ACM (JACM), 66, 1, 2018. Pages 1–18.

Digital Library

[31]

Ryan Rogers, Aaron Roth, Adam Smith, and Om Thakkar. 2016. Max-information, differential privacy, and post-selection hypothesis testing. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). Pages 487–494.

[32]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP). Pages 3–18.

[33]

Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method. arXiv preprint physics/0004057, 2000.

[34]

Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). Pages 1–5.

[35]

Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. 2019. Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems. Pages 15558–15569.

[36]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C Mozer, and Yoram Singer. 2019. Identity crisis: Memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698, 2019.

[37]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

[38]

Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. 2014. Capturing long-tail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Pages 915–922.

Digital Library

Cited By

Emanuilov IMargoni T(2024)Forget me not: memorisation in generative sequence models trained on open source licensed codeSSRN Electronic Journal10.2139/ssrn.4720990Online publication date: 2024
https://doi.org/10.2139/ssrn.4720990
Duddu VSzyller SAsokan N(2024)SoK: Unintended Interactions among Machine Learning Defenses and Risks2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00243(2996-3014)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00243
Malik JMuthalagu RPawar P(2024)A Systematic Review of Adversarial Machine Learning Attacks, Defensive Controls, and TechnologiesIEEE Access10.1109/ACCESS.2024.342332312(99382-99421)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3423323
Show More Cited By

Index Terms

When is memorization of irrelevant training data necessary for high-accuracy learning?
1. Theory of computation
  1. Computational complexity and cryptography
    1. Communication complexity
  2. Theory and algorithms for application domains
    1. Machine learning theory
      1. Models of learning
      2. Sample complexity and generalization bounds

Recommendations

How Much Training Data Is Memorized in Overparameterized Autoencoders? An Inverse Problem Perspective on Memorization Evaluation
Machine Learning and Knowledge Discovery in Databases. Research Track
Abstract
Overparameterized autoencoder models often memorize their training data. For image data, memorization is often examined by using the trained autoencoder to recover missing regions in its training images (that were used only in their complete forms ...
Unveiling Memorization in Code Models
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of lines of code from ...
Maximum margin semi-supervised learning with irrelevant data

Semi-supervised learning (SSL) is a typical learning paradigms training a model from both labeled and unlabeled data. The traditional SSL models usually assume unlabeled data are relevant to the labeled data, i.e., following the same distributions of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

STOC 2021: Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing

June 2021

1797 pages

ISBN:9781450380539

DOI:10.1145/3406325

General Chair:
Samir Khuller
Northwestern University, USA
,
Program Chair:
Virginia Vassilevska Williams
Massachusetts Institute of Technology, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGACT: ACM Special Interest Group on Algorithms and Computation Theory

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

STOC '21

Sponsor:

SIGACT

STOC '21: 53rd Annual ACM SIGACT Symposium on Theory of Computing

June 21 - 25, 2021

Virtual, Italy

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
1,157
Total Downloads

Downloads (Last 12 months)472
Downloads (Last 6 weeks)71

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Emanuilov IMargoni T(2024)Forget me not: memorisation in generative sequence models trained on open source licensed codeSSRN Electronic Journal10.2139/ssrn.4720990Online publication date: 2024
https://doi.org/10.2139/ssrn.4720990
Duddu VSzyller SAsokan N(2024)SoK: Unintended Interactions among Machine Learning Defenses and Risks2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00243(2996-3014)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00243
Malik JMuthalagu RPawar P(2024)A Systematic Review of Adversarial Machine Learning Attacks, Defensive Controls, and TechnologiesIEEE Access10.1109/ACCESS.2024.342332312(99382-99421)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3423323
Achille AKearns MKlingenberg CSoatto S(2024)AI model disgorgement: Methods and choicesProceedings of the National Academy of Sciences10.1073/pnas.2307304121121:18Online publication date: 19-Apr-2024
https://doi.org/10.1073/pnas.2307304121
Kumar P(2024)Adversarial attacks and defenses for large language models (LLMs): methods, frameworks & challengesInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00334-813:3Online publication date: 25-Jun-2024
https://doi.org/10.1007/s13735-024-00334-8
Buzaglo GHaim NYehudai GVardi GOz YNikankin YIrani MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Deconstructing data reconstructionProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668365(51515-51535)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668365
Xu HZhu TZhang LZhou WYu P(2023)Machine Unlearning: A SurveyACM Computing Surveys10.1145/360362056:1(1-36)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3603620
Chen XPeng B(2023)Memory-Query Tradeoffs for Randomized Convex Optimization2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00086(1400-1413)Online publication date: 6-Nov-2023
https://doi.org/10.1109/FOCS57990.2023.00086
Peng BRubinstein A(2023)Near Optimal Memory-Regret Tradeoff for Online Learning2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00069(1171-1194)Online publication date: 6-Nov-2023
https://doi.org/10.1109/FOCS57990.2023.00069
Kwatra STorra V(2023)Data Reconstruction Attack Against Principal Component AnalysisSecurity and Privacy in Social Networks and Big Data10.1007/978-981-99-5177-2_5(79-92)Online publication date: 14-Aug-2023
https://dl.acm.org/doi/10.1007/978-981-99-5177-2_5
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents