Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation

Published: 29 April 2024 Publication History

Abstract

This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhibit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature map and the problem dimension.

References

[1]
S. A. Murphy, “Optimal dynamic treatment regimes,” J. Roy. Stat. Soc. B, Stat. Methodol., vol. 65, no. 2, pp. 331–355, May 2003.
[2]
I. Bojinov and N. Shephard, “Time series experiments and causal estimands: Exact randomization tests and trading,” J. Amer. Stat. Assoc., vol. 114, no. 528, pp. 1665–1682, Oct. 2019.
[3]
S. Tang and J. Wiens, “Model selection for offline reinforcement learning: Practical considerations for healthcare settings,” in Proc. 6th Mach. Learn. Healthcare Conf., Aug. 2021, pp. 2–35.
[4]
C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences: A survey and comparison,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 809–883, 2014.
[5]
D. Bertsimas, P. Klasnja, S. Murphy, and L. Na, “Data-driven interpretable policy construction for personalized mobile health,” in Proc. IEEE Int. Conf. Digit. Health (ICDH), Jul. 2022, pp. 13–22.
[6]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2018.
[7]
D. P. Bertsekas, Dynamic Programming and Optimal Control, 4th ed. Belmont, MA, USA: Athena Scientific, 2017.
[8]
J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Trans. Autom. Control, vol. 42, no. 5, pp. 674–690, May 1997.
[9]
J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” Oper. Res., vol. 69, no. 3, pp. 950–973, May 2021.
[10]
J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep Q-learning,” in Proc. 2nd Learn. Dyn. Control Conf., 2020, pp. 486–489.
[11]
A.-M. Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor, “Regularized policy iteration with nonparametric function spaces,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 4809–4874, 2016.
[12]
Y. Duan, M. Wang, and M. J. Wainwright, “Optimal policy evaluation using kernel-based temporal difference methods,” 2021, arXiv:2109.12002.
[13]
D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic programming: An overview,” in Proc. 34th IEEE Conf. Decis. Control, vol. 1, Dec. 1995, pp. 560–564.
[14]
K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “A brief survey of deep reinforcement learning,” 2017, arXiv:1708.05866.
[15]
C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, “Provably efficient reinforcement learning with linear function approximation,” in Proc. Conf. Learn. Theory, 2020, pp. 2137–2143.
[16]
B. Wang, Y. Yan, and J. Fan, “Sample-efficient reinforcement learning for linearly-parameterized MDPs with a generative model,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 23009–23022.
[17]
G. Li, Y. Chen, Y. Chi, Y. Gu, and Y. Wei, “Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 16671–16685.
[18]
R. S. Sutton, “Learning to predict by the methods of temporal differences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, Aug. 1988.
[19]
G. Li, C. Cai, Y. Chen, Y. Wei, and Y. Chi, “Is Q-learning minimax optimal? A tight sample complexity analysis,” Oper. Res., vol. 72, no. 1, pp. 222–236, Jan. 2024.
[20]
G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor, “Finite sample analyses for TD(0) with function approximation,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 6144–6160.
[21]
R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,” in Proc. Conf. Learn. Theory, 2019, pp. 2803–2830.
[22]
C. Lakshminarayanan and C. Szepesvari, “Linear stochastic approximation: How far does constant step-size and iterate averaging go?” in Proc. Int. Conf. Artif. Intell. Statist., 2018, pp. 1347–1355.
[23]
L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995. Amsterdam, The Netherlands: Elsevier, 1995, pp. 30–37.
[24]
R. S. Sutton et al., “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proc. 26th Annu. Int. Conf. Mach. Learn., Jun. 2009, pp. 993–1000.
[25]
G. Dalal, G. Thoppe, B. Szörényi, and S. Mannor, “Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning,” in Proc. Conf. Learn. Theory, 2018, pp. 1199–1233.
[26]
T. Xu and Y. Liang, “Sample complexity bounds for two timescale value-based reinforcement learning algorithms,” in Proc. Int. Conf. Artif. Intell. Statist., 2021, pp. 811–819.
[27]
Y. Wang, S. Zou, and Y. Zhou, “Non-asymptotic analysis for two time-scale TDC with general smooth function approximation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 9747–9758.
[28]
G. Dalal, B. Szorenyi, and G. Thoppe, “A tale of two-timescale reinforcement learning with the tightest finite-time bound,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 4, pp. 3701–3708.
[29]
M. Kaledin, E. Moulines, A. Naumov, V. Tadic, and H.-T. Wai, “Finite time analysis of linear two-timescale stochastic approximation with Markovian noise,” in Proc. Conf. Learn. Theory, 2020, pp. 2144–2203.
[30]
H. Gupta, R. Srikant, and L. Ying, “Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 4704–4713.
[31]
A. Durmus, E. Moulines, A. Naumov, and S. Samsonov, “Finite-time high-probability bounds for polyak-ruppert averaged iterates of linear stochastic approximation,” 2022, arXiv:2207.04475.
[32]
C. Szepesvári, “The asymptotic convergence-rate of Q-learning,” in Proc. Adv. Neural Inf. Process. Syst., 1998, pp. 1064–1070.
[33]
K. Khamaru, A. Pananjady, F. Ruan, M. J. Wainwright, and M. I. Jordan, “Is temporal difference learning optimal? An instance-dependent analysis,” 2020, arXiv:2003.07337.
[34]
C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learning provably efficient?” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 4863–4873.
[35]
J. A. Boyan, “Least-squares temporal difference learning,” in Proc. ICML, 1999, pp. 49–56.
[36]
A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye, “Near-optimal time and sample complexities for solving Markov decision processes with a generative model,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 5186–5196.
[37]
A. Agarwal, S. Kakade, and L. F. Yang, “Model-based reinforcement learning with a generative model is minimax optimal,” in Proc. 33rd Conf. Learn. Theory, 2020, pp. 67–83.
[38]
A. Pananjady and M. J. Wainwright, “Instance-dependent -bounds for policy evaluation in tabular reinforcement learning,” IEEE Trans. Inf. Theory, vol. 67, no. 1, pp. 566–585, Jan. 2021.
[39]
G. Li, Y. Wei, Y. Chi, and Y. Chen, “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” Oper. Res., vol. 72, no. 1, pp. 203–221, Jan. 2024.
[40]
T. L. Lai, “Stochastic approximation,” Ann. Statist., vol. 31, no. 2, pp. 391–406, 2003.
[41]
H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, Sep. 1951.
[42]
V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Cham, Switzerland: Springer, 2009.
[43]
V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM J. Control Optim., vol. 38, no. 2, pp. 447–469, 2000.
[44]
W. Mou, C. Junchi Li, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan, “On linear stochastic approximation: Fine-grained polyak-ruppert and non-asymptotic concentration,” 2020, arXiv:2004.04719.
[45]
E. Moulines and F. Bach, “Non-asymptotic analysis of stochastic approximation algorithms for machine learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 24, 2011, pp. 451–459.
[46]
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, Jan. 2009.
[47]
T. Xu, S. Zou, and Y. Liang, “Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 10633–10643.
[48]
Y. F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two time-scale actor-critic methods,” in Proc. NIPS, 2020, pp. 17617–17628.
[49]
D. Precup, “Eligibility traces for off-policy policy evaluation,” in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 759–766.
[50]
N. Jiang and L. Li, “Doubly robust off-policy value evaluation for reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 652–661.
[51]
C. Ma, B. Zhu, J. Jiao, and M. J. Wainwright, “Minimax off-policy evaluation for multi-armed bandits,” IEEE Trans. Inf. Theory, vol. 68, no. 8, pp. 5314–5339, Aug. 2022.
[52]
P. Thomas and E. Brunskill, “Data-efficient off-policy policy evaluation for reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2139–2148.
[53]
T. Xie, Y. Ma, and Y.-X. Wang, “Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 9668–9678.
[54]
N. Kallus and M. Uehara, “Double reinforcement learning for efficient off-policy evaluation in Markov decision processes,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 6742–6804, 2020.
[55]
M. Yang, O. Nachum, B. Dai, L. Li, and D. Schuurmans, “Off-policy evaluation via the regularized Lagrangian,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 6551–6561.
[56]
Y. Duan, Z. Jia, and M. Wang, “Minimax-optimal off-policy evaluation with linear function approximation,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 2701–2709.
[57]
Y. Jin, Z. Yang, and Z. Wang, “Is pessimism provably efficient for offline RL?” in Proc. Int. Conf. Mach. Learn., 2021, pp. 5084–5096.
[58]
T. Xie, N. Jiang, H. Wang, C. Xiong, and Y. Bai, “Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 27395–27407.
[59]
G. Li, L. Shi, Y. Chen, Y. Chi, and Y. Wei, “Settling the sample complexity of model-based offline reinforcement learning,” 2022, arXiv:2204.05275.
[60]
L. Shi, G. Li, Y. Wei, Y. Chen, and Y. Chi, “Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 19967–20025.
[61]
P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 11702–11716.
[62]
G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction,” IEEE Trans. Inf. Theory, vol. 68, no. 1, pp. 448–473, Jan. 2022.
[63]
W. Mou, A. Pananjady, and M. J. Wainwright, “Optimal Oracle inequalities for solving projected fixed-point equations,” 2020, arXiv:2012.05299.
[64]
T. Li, G. Lan, and A. Pananjady, “Accelerated and instance-optimal policy evaluation with linear function approximation,” 2021, arXiv:2112.13109.
[65]
R. S. Sutton, C. Szepesvári, and H. R. Maei, “A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation,” in Proc. Adv. Neural Inf. Process. Syst., 2008, vol. 21, no. 21, pp. 1609–1616.
[66]
E. N. Gilbert, “A comparison of signalling alphabets,” Bell Syst. Tech. J., vol. 31, no. 3, pp. 504–522, May 1952.
[67]
A. B. Tsybakov, Introduction to Nonparametric Estimation, vol. 11. Cham, Switzerland: Springer, 2009.
[68]
J. Tropp, “Freedman’s inequality for matrix Martingales,” Electron. Commun. Probab., vol. 16, pp. 262–270, Jan. 2011.
[69]
J. A. Tropp, “An introduction to matrix concentration inequalities,” Foundations Trends Mach. Learn., vol. 8, nos. 1–2, pp. 1–230, 2015. 10.1561/2200000048.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Information Theory
IEEE Transactions on Information Theory  Volume 70, Issue 8
Aug. 2024
708 pages

Publisher

IEEE Press

Publication History

Published: 29 April 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media