A Temporal-Difference Approach to Policy Gradient Estimation

Tosatto, Samuele; Patterson, Andrew; White, Martha; Mahmood, A. Rupam

Computer Science > Machine Learning

arXiv:2202.02396 (cs)

[Submitted on 4 Feb 2022 (v1), last revised 7 Jul 2022 (this version, v4)]

Title:A Temporal-Difference Approach to Policy Gradient Estimation

Authors:Samuele Tosatto, Andrew Patterson, Martha White, A. Rupam Mahmood

View PDF

Abstract:The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2202.02396 [cs.LG]
	(or arXiv:2202.02396v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2202.02396

Submission history

From: Samuele Tosatto Dr. [view email]
[v1] Fri, 4 Feb 2022 21:23:33 UTC (1,539 KB)
[v2] Sat, 11 Jun 2022 14:09:01 UTC (2,553 KB)
[v3] Tue, 28 Jun 2022 19:45:29 UTC (4,773 KB)
[v4] Thu, 7 Jul 2022 17:54:40 UTC (4,771 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2022-02

Change to browse by:

cs
cs.AI

References & Citations

DBLP - CS Bibliography

listing | bibtex

Andrew Patterson
Martha White
A. Rupam Mahmood

export BibTeX citation

Computer Science > Machine Learning

Title:A Temporal-Difference Approach to Policy Gradient Estimation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A Temporal-Difference Approach to Policy Gradient Estimation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators