Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3540261.3540943guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

On empirical risk minimization with dependent and heavy-tailed data

Published: 06 December 2021 Publication History

Abstract

In this work, we establish risk bounds for the Empirical Risk Minimization (ERM) with both dependent and heavy-tailed data-generating processes. We do so by extending the seminal works [Men15, Men18] on the analysis of ERM with heavy-tailed but independent and identically distributed observations, to the strictly stationary exponentially β-mixing case. Our analysis is based on explicitly controlling the multiplier process arising from the interaction between the noise and the function evaluations on inputs. It allows for the interaction to be even polynomially heavy-tailed, which covers a significantly large class of heavy-tailed models beyond what is analyzed in the learning theory literature. We illustrate our results by deriving rates of convergence for the high-dimensional linear regression problem with dependent and heavy-tailed data.

Supplementary Material

Additional material (3540261.3540943_supp.pdf)
Supplemental material.

References

[1]
Carlo Acerbi, Spectral measures of risk: A coherent representation of subjective risk aversion, Journal of Banking & Finance 26 (2002), no. 7, 1505–1518. (Cited on page 2.)
[2]
Pierre Alquier, Xiaoyin Li, and Olivier Wintenberger, Prediction of time series by statistical learning: General losses and fast rates, Dependence Modeling 1 (2013), no. 2013, 65–93. (Cited on page 2.)
[3]
David Aldous and Umesh Vazirani, A Markovian extension of Valiant's learning model, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science, IEEE, 1990, pp. 392–396. (Cited on page 2.)
[4]
Pierre Alquier and Olivier Wintenberger, Model selection for weakly dependent time series forecasting, Bernoulli 18 (2012), no. 3, 883–913. (Cited on page 3.)
[5]
Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson, Local Rademacher complexities, The Annals of Statistics 33 (2005), no. 4, 1497–1537. (Cited on page 1.)
[6]
Graciela Boente and Ricardo Fraiman, Robust nonparametric regression estimation for dependent observations, The Annals of Statistics (1989), 1242–1256. (Cited on page 2.)
[7]
Rakesh D Barve and Philip M Long, On the complexity of learning from drifting distributions, Information and Computation 138 (1997), no. 2, 170–193. (Cited on page 2.)
[8]
Daniel Bartl and Shahar Mendelson, On Monte-Carlo methods in convex stochastic optimization, arXiv preprint arXiv:2101.07794 (2021). (Cited on page 2.)
[9]
Milad Bakhshizadeh, Arian Maleki, and Victor H de la Pena, Sharp concentration results for heavy-tailed distributions, arXiv preprint arXiv:2003.13819 (2020). (Cited on pages 2 and 6.)
[10]
Patrizia Berti and Pietro Rigo, A Glivenko-Cantelli theorem for exchangeable random variables, Statistics & probability letters 32 (1997), no. 4, 385–391. (Cited on page 2.)
[11]
Richard C Bradley, Basic properties of strong mixing conditions. a survey and some open questions, arXiv preprint math/0511078 (2005). (Cited on page 9.)
[12]
Kacper Chwialkowski and Arthur Gretton, A kernel independence test for random processes, International Conference on Machine Learning, PMLR, 2014, pp. 1422–1430. (Cited on pages 5 and 16.)
[13]
Christophe Chesneau, A tail bound for sums of independent random variables: application to the symmetric pareto distribution. (Cited on page 36.)
[14]
Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti, Learning from weakly dependent data under Dobrushin's condition, Conference on Learning Theory, PMLR, 2019, pp. 914–928. (Cited on page 2.)
[15]
Murat Dundar, Balaji Krishnapuram, Jinbo Bi, and R Bharat Rao, Learning classifers when the training data is not IID., IJCAI, vol. 2007, 2007, pp. 756–61. (Cited on page 2.)
[16]
Jérôme Dedecker and Clémentine Prieur, Coupling for τ-dependent sequences and applications, Journal of Theoretical Probability 17 (2004), no. 4, 861–885. (Cited on page 15.)
[17]
A Philip Dawid and Ambuj Tewari, On learnability under general stochastic processes, arXiv preprint arXiv:2005.07605 (2020). (Cited on page 2.)
[18]
Amir-massoud Farahmand and Csaba Szepesvári, Regularized least-squares regression: Learning from a β-mixing sequence, Journal of Statistical Planning and Inference 142 (2012), no. 2, 493–505. (Cited on page 4.)
[19]
David Gamarnik, Extension of the PAC framework to finite and countable markov chains, IEEE Transactions on Information Theory 49 (2003), no. 1, 338–345. (Cited on page 2.)
[20]
Peter D Grünwald and Nishant A Mehta, Fast rates for general unbounded loss functions: From ERM to Generalized Bayes., Journal of Machine Learning Research 21 (2020), no. 56, 1–80. (Cited on page 2.)
[21]
Steve Hanneke, Learning whenever learning is possible: Universal learning under general stochastic processes, Journal of Machine Learning Research (to appear) (2021). (Cited on page 2.)
[22]
Matthew J Holland and El Mehdi Haress, Spectral risk-based learning using unbounded losses, arXiv preprint arXiv:2105.04816 (2021). (Cited on page 2.)
[23]
Hanyuan Hang and Ingo Steinwart, Fast learning from α-mixing observations, Journal of Multivariate Analysis 127 (2014), 184–199. (Cited on page 2.)
[24]
Peter J Huber, Robust estimation of a location parameter, Breakthroughs in statistics, Springer, 1992, pp. 492–518. (Cited on page 2.)
[25]
Qiyang Han and Jon A Wellner, Convergence rates of least squares regression estimators with heavy-tailed errors, Annals of Statistics 47 (2019), no. 4, 2286–2319. (Cited on pages 3 and 7.)
[26]
A Irle, On consistency in nonparametric estimation under mixing conditions, Journal of multivariate analysis 60 (1997), no. 1, 123–147. (Cited on page 2.)
[27]
Jiancheng Jiang and YP Mack, Robust local polynomial regression for dependent data, Statistica Sinica (2001), 705–722. (Cited on page 2.)
[28]
Vardis Kandiros, Yuval Dagan, Nishanth Dikkala, Surbhi Goel, and Constantinos Daskalakis, Statistical estimation from dependent data, International Conference on Machine Learning, PMLR, 2021, pp. 5269–5278. (Cited on page 2.)
[29]
Vitaly Kuznetsov and Mehryar Mohri, Generalization bounds for non-stationary mixing processes, Machine Learning 106 (2017), no. 1, 93–117. (Cited on pages 3 and 4.)
[30]
Vladimir Koltchinskii, Local Rademacher complexities and oracle inequalities in risk minimization, Annals of Statistics 34 (2006), no. 6, 2593–2656. (Cited on pages 1 and 3.)
[31]
Vladimir Koltchinskii, Oracle inequalities in empirical risk minimization and sparse recovery problems, vol. 2033, Springer Science & Business Media, 2011. (Cited on pages 1 and 2.)
[32]
Guillaume Lecué and Matthieu Lerasle, Robust machine learning by median-of-means: theory and practice, Annals of Statistics 48 (2020), no. 2, 906–931. (Cited on page 2.)
[33]
Guillaume Lecué and Shahar Mendelson, Regularization and the small-ball method I: Sparse recovery, The Annals of Statistics 46 (2018), no. 2, 611–641. (Cited on page 2.)
[34]
Gábor Lugosi and Shahar Mendelson, Regularization, sparse recovery, and median-of-means tournaments, Bernoulli 25 (2019), no. 3, 2075–2106. (Cited on pages 2 and 9.)
[35]
Po-Ling Loh, Statistical consistency and asymptotic normality for high-dimensional robust m-estimators, The Annals of Statistics 45 (2017), no. 2, 866–896. (Cited on page 2.)
[36]
Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan, Learning with square loss: Localization through offset Rademacher complexity, Conference on Learning Theory, PMLR, 2015, pp. 1260–1285. (Cited on page 2.)
[37]
Michel Ledoux and Michel Talagrand, Probability in banach spaces: isoperimetry and processes, Springer Science & Business Media, 2013. (Cited on page 2.)
[38]
Ron Meir, Nonparametric time series prediction through adaptive model selection, Machine learning 39 (2000), no. 1, 5–34. (Cited on page 3.)
[39]
Shahar Mendelson, Learning without concentration, Journal of the ACM (JACM) 62 (2015), no. 3, 1–25. (Cited on pages 1, 2, 4, 7, 9, 10, 24, 34, 35, and 39.)
[40]
Ron Meir, Local vs. global parameters: Breaking the Gaussian complexity barrier, Annals of statistics 45 (2017), no. 5, 1835–1862. (Cited on page 2.)
[41]
Ron Meir, On multiplier processes under weak moment assumptions, Geometric aspects of functional analysis, Springer, 2017, pp. 301–318. (Cited on page 2.)
[42]
Ron Meir, Learning without concentration for general loss functions, Probability Theory and Related Fields 171 (2018), no. 1, 459–502. (Cited on pages 1, 2, 5, 10, 29, and 33.)
[43]
Zakaria Mhammedi, Benjamin Guedj, and Robert C Williamson, Pac-bayesian bound for the conditional value at risk, arXiv preprint arXiv:2006.14763 (2020). (Cited on page 2.)
[44]
Stanislav Minsker and Timothée Mathieu, Excess risk bounds in robust empirical risk minimization, arXiv preprint arXiv:1910.07485 (2019). (Cited on page 2.)
[45]
Abdelkader Mokkadem, Mixing properties of arma processes, Stochastic processes and their applications 29 (1988), no. 2, 309–315. (Cited on page 4.)
[46]
Florence Merlevède, Magda Peligrad, and Emmanuel Rio, A Bernstein type inequality and moderate deviations for weakly dependent sequences, Probability Theory and Related Fields 151 (2011), no. 3-4, 435–474. (Cited on pages 2 and 5.)
[47]
Mehryar Mohri and Afshin Rostamizadeh, Rademacher complexity bounds for Non-IID processes, Proceedings of the 21st International Conference on Neural Information Processing Systems, 2008, pp. 1097–1104. (Cited on pages 2 and 4.)
[48]
Daniel J McDonald and Cosma Rohilla Shalizi, Rademacher complexity of stationary sequences, arXiv preprint arXiv:1106.0730 (2011). (Cited on page 2.)
[49]
Shahar Mendelson and Nikita Zhivotovskiy, Robust covariance estimation under l4-l2 norm equivalence, Annals of Statistics 48 (2020), no. 3, 1648–1664. (Cited on page 6.)
[50]
Andrew B Nobel, Limits to classification and regression estimation from ergodic processes, The Annals of Statistics 27 (1999), no. 1, 262–273. (Cited on page 2.)
[51]
Vladimir Pestov, Predictive PAC learnability: A paradigm for learning from exchangeable input data, 2010 IEEE International Conference on Granular Computing, IEEE, 2010, pp. 387–391. (Cited on page 2.)
[52]
RN Pillai, Semi-pareto processes, Journal of applied probability (1991), 461–465. (Cited on page 8.)
[53]
Miroslav M Ristić, A generalized semi-pareto minification process, Statistical Papers 49 (2008), no. 2, 343–351. (Cited on page 9.)
[54]
Andrzej Ruszczyński and Alexander Shapiro, Optimization of convex risk functions, Mathematics of operations research 31 (2006), no. 3, 433–452. (Cited on page 2.)
[55]
Alexander Rakhlin and Karthik Sridharan, Online non-parametric regression, Conference on Learning Theory, PMLR, 2014, pp. 1232–1264. (Cited on page 2.)
[56]
Liva Ralaivola, Marie Szafranski, and Guillaume Stempfel, Chromatic PAC-Bayes bounds for non-IID data: Applications to ranking and stationary β-mixing processes, The Journal of Machine Learning Research 11 (2010), 1927–1956. (Cited on page 2.)
[57]
Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari, Sequential complexities and uniform martingale laws of large numbers, Probability Theory and Related Fields 161 (2015), no. 1-2, 111–153. (Cited on page 2.)
[58]
R Tyrrell Rockafellar and Stanislav Uryasev, Conditional value-at-risk for general loss distributions, Journal of banking & finance 26 (2002), no. 7, 1443–1471. (Cited on page 2.)
[59]
Ingo Steinwart, Don Hush, and Clint Scovel, Learning from dependent observations, Journal of Multivariate Analysis 100 (2009), no. 1, 175–194. (Cited on page 2.)
[60]
Tasuku Soma and Yuichi Yoshida, Statistical learning with conditional value at risk, arXiv preprint arXiv:2002.05826 (2020). (Cited on page 2.)
[61]
Vladimir Vapnik and Aleksei Chervonenkis, On uniform convergence of the frequencies of events to their probabilities, Teoriya Veroyatnostei i ee Primeneniya 16 (1971), no. 2, 264–279. (Cited on page 1.)
[62]
Sara van de Geer, Empirical processes in m-estimation, vol. 6, Cambridge university press, 2000. (Cited on page 1.)
[63]
Aad W Van Der Vaart and Jon A Wellner, Weak convergence, Weak convergence and empirical processes, Springer, 1996, pp. 16–28. (Cited on page 1.)
[64]
Mariia Vladimirova, Stéphane Girard, Hien Nguyen, and Julyan Arbel, Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions, Stat 9 (2020), no. 1, e318. (Cited on pages 10 and 34.)
[65]
Mathukumalli Vidyasagar, Learning and generalisation: With applications to neural networks, Springer Science & Business Media, 2013. (Cited on pages 3 and 4.)
[66]
Mathukumalli Vidyasagar and Rajeeva L Karandikar, A learning theory approach to system identification and stochastic adaptive control, Probabilistic and randomized methods for design under uncertainty, Springer, 2006, pp. 265–302. (Cited on page 4.)
[67]
Kam Chung Wong, Zifan Li, and Ambuj Tewari, Lasso guarantees for β-mixing heavy-tailed time series, Annals of Statistics 48 (2020), no. 2, 1124–1142. (Cited on pages 3, 4, 8, and 34.)
[68]
Di Wang, Yao Zheng, Heng Lian, and Guodong Li, High-dimensional vector autore-gressive time series modeling via tensor decomposition, 2020. (Cited on pages 10 and 34.)
[69]
Bin Yu, Rates of convergence for empirical processes of stationary mixing sequences, The Annals of Probability (1994), 94–116. (Cited on pages 2, 3, 4, 18, and 25.)
[70]
Yongquan Zhang, Feilong Cao, and Canwei Yan, Learning rates of least-square regularized regression with strongly mixing observation, International Journal of Machine Learning and Cybernetics 3 (2012), no. 4, 277–283. (Cited on page 2.)
[71]
Lijun Zhang and Zhi-Hua Zhou, 1-regression with heavy-tailed distributions, Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 1084–1094. (Cited on page 9.)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems
December 2021
30517 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2021

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media