Projected state-action balancing weights for offline reinforcement learning

Jiayi Wang; Zhengling Qi; Raymond K. W. Wong

doi:10.1214/23-AOS2302

Abstract

Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.

Funding Statement

Wong’s research was partially supported by the National Science Foundation (DMS-1711952 and CCF-1934904). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

Acknowledgments

The authors would like to thank the anonymous referees, the Associate Editor and the Editor for their constructive comments that improved the quality of this paper.

Citation

Download Citation

Jiayi Wang. Zhengling Qi. Raymond K. W. Wong. "Projected state-action balancing weights for offline reinforcement learning." Ann. Statist. 51 (4) 1639 - 1665, August 2023. https://doi.org/10.1214/23-AOS2302

Information

Received: 1 May 2022; Revised: 1 March 2023; Published: August 2023

First available in Project Euclid: 19 October 2023

Digital Object Identifier: 10.1214/23-AOS2302

Subjects:

Primary: 62G05 , 62M05

Keywords: Infinite horizons , Markov decision process , policy evaluation , reinforcement learning

Abstract

Funding Statement

Acknowledgments

Citation

Information

KEYWORDS/PHRASES

PUBLICATION TITLE:

PUBLICATION YEARS