Abstract
Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
Funding Statement
Wong’s research was partially supported by the National Science Foundation (DMS-1711952 and CCF-1934904). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.
Acknowledgments
The authors would like to thank the anonymous referees, the Associate Editor and the Editor for their constructive comments that improved the quality of this paper.
Citation
Jiayi Wang. Zhengling Qi. Raymond K. W. Wong. "Projected state-action balancing weights for offline reinforcement learning." Ann. Statist. 51 (4) 1639 - 1665, August 2023. https://doi.org/10.1214/23-AOS2302
Information