research-article

High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation

Authors:

Alessandro Rinaldo,

Yuting WeiAuthors Info & Claims

IEEE Transactions on Information Theory, Volume 70, Issue 8

Pages 5969 - 5999

https://doi.org/10.1109/TIT.2024.3394685

Published: 29 April 2024 Publication History

Abstract

This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhibit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature map and the problem dimension.

References

[1]

S. A. Murphy, “Optimal dynamic treatment regimes,” J. Roy. Stat. Soc. B, Stat. Methodol., vol. 65, no. 2, pp. 331–355, May 2003.

[2]

I. Bojinov and N. Shephard, “Time series experiments and causal estimands: Exact randomization tests and trading,” J. Amer. Stat. Assoc., vol. 114, no. 528, pp. 1665–1682, Oct. 2019.

[3]

S. Tang and J. Wiens, “Model selection for offline reinforcement learning: Practical considerations for healthcare settings,” in Proc. 6th Mach. Learn. Healthcare Conf., Aug. 2021, pp. 2–35.

[4]

C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences: A survey and comparison,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 809–883, 2014.

Digital Library

[5]

D. Bertsimas, P. Klasnja, S. Murphy, and L. Na, “Data-driven interpretable policy construction for personalized mobile health,” in Proc. IEEE Int. Conf. Digit. Health (ICDH), Jul. 2022, pp. 13–22.

[6]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2018.

Digital Library

[7]

D. P. Bertsekas, Dynamic Programming and Optimal Control, 4th ed. Belmont, MA, USA: Athena Scientific, 2017.

[8]

J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Trans. Autom. Control, vol. 42, no. 5, pp. 674–690, May 1997.

[9]

J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” Oper. Res., vol. 69, no. 3, pp. 950–973, May 2021.

Digital Library

[10]

J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep Q-learning,” in Proc. 2nd Learn. Dyn. Control Conf., 2020, pp. 486–489.

[11]

A.-M. Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor, “Regularized policy iteration with nonparametric function spaces,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 4809–4874, 2016.

Digital Library

[12]

Y. Duan, M. Wang, and M. J. Wainwright, “Optimal policy evaluation using kernel-based temporal difference methods,” 2021, arXiv:2109.12002.

[13]

D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic programming: An overview,” in Proc. 34th IEEE Conf. Decis. Control, vol. 1, Dec. 1995, pp. 560–564.

[14]

K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “A brief survey of deep reinforcement learning,” 2017, arXiv:1708.05866.

[15]

C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, “Provably efficient reinforcement learning with linear function approximation,” in Proc. Conf. Learn. Theory, 2020, pp. 2137–2143.

[16]

B. Wang, Y. Yan, and J. Fan, “Sample-efficient reinforcement learning for linearly-parameterized MDPs with a generative model,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 23009–23022.

[17]

G. Li, Y. Chen, Y. Chi, Y. Gu, and Y. Wei, “Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 16671–16685.

[18]

R. S. Sutton, “Learning to predict by the methods of temporal differences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, Aug. 1988.

[19]

G. Li, C. Cai, Y. Chen, Y. Wei, and Y. Chi, “Is Q-learning minimax optimal? A tight sample complexity analysis,” Oper. Res., vol. 72, no. 1, pp. 222–236, Jan. 2024.

Digital Library

[20]

G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor, “Finite sample analyses for TD(0) with function approximation,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 6144–6160.

[21]

R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,” in Proc. Conf. Learn. Theory, 2019, pp. 2803–2830.

[22]

C. Lakshminarayanan and C. Szepesvari, “Linear stochastic approximation: How far does constant step-size and iterate averaging go?” in Proc. Int. Conf. Artif. Intell. Statist., 2018, pp. 1347–1355.

[23]

L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995. Amsterdam, The Netherlands: Elsevier, 1995, pp. 30–37.

[24]

R. S. Sutton et al., “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proc. 26th Annu. Int. Conf. Mach. Learn., Jun. 2009, pp. 993–1000.

[25]

G. Dalal, G. Thoppe, B. Szörényi, and S. Mannor, “Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning,” in Proc. Conf. Learn. Theory, 2018, pp. 1199–1233.

[26]

T. Xu and Y. Liang, “Sample complexity bounds for two timescale value-based reinforcement learning algorithms,” in Proc. Int. Conf. Artif. Intell. Statist., 2021, pp. 811–819.

[27]

Y. Wang, S. Zou, and Y. Zhou, “Non-asymptotic analysis for two time-scale TDC with general smooth function approximation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 9747–9758.

[28]

G. Dalal, B. Szorenyi, and G. Thoppe, “A tale of two-timescale reinforcement learning with the tightest finite-time bound,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 4, pp. 3701–3708.

[29]

M. Kaledin, E. Moulines, A. Naumov, V. Tadic, and H.-T. Wai, “Finite time analysis of linear two-timescale stochastic approximation with Markovian noise,” in Proc. Conf. Learn. Theory, 2020, pp. 2144–2203.

[30]

H. Gupta, R. Srikant, and L. Ying, “Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 4704–4713.

[31]

A. Durmus, E. Moulines, A. Naumov, and S. Samsonov, “Finite-time high-probability bounds for polyak-ruppert averaged iterates of linear stochastic approximation,” 2022, arXiv:2207.04475.

[32]

C. Szepesvári, “The asymptotic convergence-rate of Q-learning,” in Proc. Adv. Neural Inf. Process. Syst., 1998, pp. 1064–1070.

[33]

K. Khamaru, A. Pananjady, F. Ruan, M. J. Wainwright, and M. I. Jordan, “Is temporal difference learning optimal? An instance-dependent analysis,” 2020, arXiv:2003.07337.

[34]

C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learning provably efficient?” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 4863–4873.

[35]

J. A. Boyan, “Least-squares temporal difference learning,” in Proc. ICML, 1999, pp. 49–56.

[36]

A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye, “Near-optimal time and sample complexities for solving Markov decision processes with a generative model,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 5186–5196.

[37]

A. Agarwal, S. Kakade, and L. F. Yang, “Model-based reinforcement learning with a generative model is minimax optimal,” in Proc. 33rd Conf. Learn. Theory, 2020, pp. 67–83.

[38]

A. Pananjady and M. J. Wainwright, “Instance-dependent ℓ_∞-bounds for policy evaluation in tabular reinforcement learning,” IEEE Trans. Inf. Theory, vol. 67, no. 1, pp. 566–585, Jan. 2021.

Digital Library

[39]

G. Li, Y. Wei, Y. Chi, and Y. Chen, “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” Oper. Res., vol. 72, no. 1, pp. 203–221, Jan. 2024.

Digital Library

[40]

T. L. Lai, “Stochastic approximation,” Ann. Statist., vol. 31, no. 2, pp. 391–406, 2003.

[41]

H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, Sep. 1951.

[42]

V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Cham, Switzerland: Springer, 2009.

[43]

V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM J. Control Optim., vol. 38, no. 2, pp. 447–469, 2000.

Digital Library

[44]

W. Mou, C. Junchi Li, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan, “On linear stochastic approximation: Fine-grained polyak-ruppert and non-asymptotic concentration,” 2020, arXiv:2004.04719.

[45]

E. Moulines and F. Bach, “Non-asymptotic analysis of stochastic approximation algorithms for machine learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 24, 2011, pp. 451–459.

[46]

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, Jan. 2009.

Digital Library

[47]

T. Xu, S. Zou, and Y. Liang, “Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 10633–10643.

[48]

Y. F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two time-scale actor-critic methods,” in Proc. NIPS, 2020, pp. 17617–17628.

[49]

D. Precup, “Eligibility traces for off-policy policy evaluation,” in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 759–766.

[50]

N. Jiang and L. Li, “Doubly robust off-policy value evaluation for reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 652–661.

[51]

C. Ma, B. Zhu, J. Jiao, and M. J. Wainwright, “Minimax off-policy evaluation for multi-armed bandits,” IEEE Trans. Inf. Theory, vol. 68, no. 8, pp. 5314–5339, Aug. 2022.

Digital Library

[52]

P. Thomas and E. Brunskill, “Data-efficient off-policy policy evaluation for reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2139–2148.

[53]

T. Xie, Y. Ma, and Y.-X. Wang, “Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 9668–9678.

[54]

N. Kallus and M. Uehara, “Double reinforcement learning for efficient off-policy evaluation in Markov decision processes,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 6742–6804, 2020.

Digital Library

[55]

M. Yang, O. Nachum, B. Dai, L. Li, and D. Schuurmans, “Off-policy evaluation via the regularized Lagrangian,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 6551–6561.

[56]

Y. Duan, Z. Jia, and M. Wang, “Minimax-optimal off-policy evaluation with linear function approximation,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 2701–2709.

[57]

Y. Jin, Z. Yang, and Z. Wang, “Is pessimism provably efficient for offline RL?” in Proc. Int. Conf. Mach. Learn., 2021, pp. 5084–5096.

[58]

T. Xie, N. Jiang, H. Wang, C. Xiong, and Y. Bai, “Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 27395–27407.

[59]

G. Li, L. Shi, Y. Chen, Y. Chi, and Y. Wei, “Settling the sample complexity of model-based offline reinforcement learning,” 2022, arXiv:2204.05275.

[60]

L. Shi, G. Li, Y. Wei, Y. Chen, and Y. Chi, “Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 19967–20025.

[61]

P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 11702–11716.

[62]

G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction,” IEEE Trans. Inf. Theory, vol. 68, no. 1, pp. 448–473, Jan. 2022.

Digital Library

[63]

W. Mou, A. Pananjady, and M. J. Wainwright, “Optimal Oracle inequalities for solving projected fixed-point equations,” 2020, arXiv:2012.05299.

[64]

T. Li, G. Lan, and A. Pananjady, “Accelerated and instance-optimal policy evaluation with linear function approximation,” 2021, arXiv:2112.13109.

[65]

R. S. Sutton, C. Szepesvári, and H. R. Maei, “A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation,” in Proc. Adv. Neural Inf. Process. Syst., 2008, vol. 21, no. 21, pp. 1609–1616.

[66]

E. N. Gilbert, “A comparison of signalling alphabets,” Bell Syst. Tech. J., vol. 31, no. 3, pp. 504–522, May 1952.

[67]

A. B. Tsybakov, Introduction to Nonparametric Estimation, vol. 11. Cham, Switzerland: Springer, 2009.

[68]

J. Tropp, “Freedman’s inequality for matrix Martingales,” Electron. Commun. Probab., vol. 16, pp. 262–270, Jan. 2011.

[69]

J. A. Tropp, “An introduction to matrix concentration inequalities,” Foundations Trends Mach. Learn., vol. 8, nos. 1–2, pp. 1–230, 2015. 10.1561/2200000048.

Digital Library

Recommendations

Least Squares Policy Evaluation Algorithms with Linear Function Approximation

We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, ...
Variance-aware off-policy evaluation with linear function approximation
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to ...
Finite-sample analysis for SARSA with linear function approximation
NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement learning. We investigate the SARSA algorithm with linear function approximation under the non-i.i.d. data, where a single sample trajectory is available. With a ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Information Theory

IEEE Transactions on Information Theory Volume 70, Issue 8

Aug. 2024

708 pages

Issue’s Table of Contents

0018-9448 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 29 April 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents