Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling

Prashanth, L. A.; Korda, Nathaniel; Munos, Rémi

Computer Science > Machine Learning

arXiv:1306.2557 (cs)

[Submitted on 11 Jun 2013 (v1), last revised 24 Jan 2020 (this version, v6)]

Title:Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling

Authors:L.A. Prashanth, Nathaniel Korda, Rémi Munos

View PDF

Abstract:We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an $O(d)$ improvement in complexity in comparison to LSTD, where $d$ is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed in [23]. Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. These rate results coupled with the low computational complexity of our method make it attractive for implementation in big data settings, where $d$ is large. A similar low-complexity alternative for least squares regression is well-known as the stochastic gradient descent (SGD) algorithm. We provide finite-time bounds for SGD. We demonstrate the practicality of our method as an efficient alternative for pathwise LSTD empirically by combining it with the least squares policy iteration (LSPI) algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1306.2557 [cs.LG]
	(or arXiv:1306.2557v6 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1306.2557

Submission history

From: L.A. Prashanth [view email]
[v1] Tue, 11 Jun 2013 15:42:00 UTC (32 KB)
[v2] Wed, 19 Feb 2014 00:01:53 UTC (2,053 KB)
[v3] Mon, 16 Jun 2014 13:39:28 UTC (580 KB)
[v4] Wed, 18 Jun 2014 17:13:38 UTC (580 KB)
[v5] Tue, 28 Nov 2017 14:16:23 UTC (191 KB)
[v6] Fri, 24 Jan 2020 16:44:09 UTC (188 KB)

Computer Science > Machine Learning

Title:Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators