Greedy-GQ with linear function approximation, originally proposed in Maei et al. (in: Proceedings of the international conference on machine learning (ICML), 2010), is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as \(\mathcal {O}({1}/{\sqrt{T}})\) under the i.i.d. setting and \(\mathcal {O}({\log T}/{\sqrt{T}})\) under the Markovian setting. We further design variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is \(\mathcal {O}({\log (1/\epsilon )\epsilon ^{-2}})\), which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with the one of the stochastic gradient descent algorithm for general smooth non-convex optimization problems, despite of its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.
Appendix 1: Analysis for vanilla Greedy-GQ
In the following proof, \(\Vert a\Vert\) denotes the \(\ell _2\) norm if a is a vector; and \(\Vert A\Vert\) denotes the operator norm if A is a matrix. For technical convenience, we impose a projection step on both the updates of \(\theta\) and \(\omega\) with radius R: for any t, \(\Vert \theta _t\Vert \le R\) and \(\Vert \omega _t\Vert \le R\). The projection step is necessary to guarantee the stability of the algorithm. The approach developed in Srikant and Ying (2019) which bounds the parameter using its retrospective copy several time steps back, is not applicable here due to the nonlinear structure of Greedy-GQ.
We first show that the objective function \(J(\theta )\) is K-smooth for \(\theta \in \{\theta : \Vert \theta \Vert \le R\}\).
Lemma 2
\(J(\theta )\) is K-smooth:
where \(K=2\gamma {\lambda ^{-1}}\big ((k_1\vert \mathcal {A}\vert R+1)(1+\gamma +\gamma Rk_1\vert \mathcal {A}\vert )+\vert \mathcal {A}\vert (r_{\max }+R+\gamma R)( 2k_1+ k_2R) \big ).\)
It follows that
Since \(C^{-1}\) is positive definite, thus it suffices to show both \(\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right] \right)\) and \(\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right]\) are Lipschitz in \(\theta\) and are bounded.
It is straightforward to see that
and \(\Vert \nabla \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right] \Vert \le 1+\gamma (k_1\vert \mathcal {A}\vert R+1).\) We further have that
which is from Assumption 3.
Following similar steps, we can also show that \(\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right]\) is Lipschitz in \(\theta\):
Combining (14) and (16) concludes the proof. \(\square\)
Recall the definition of \(G_{t+1}(\theta , \omega )\) in Sect. 3.3. The following Lemma shows that \(G_{t+1}(\theta , \omega )\) is Lipschitz in \(\omega\), and \(G_{t+1}(\theta , \omega ^*(\theta ))\) is Lipschitz in \(\theta\).
Lemma 3
For any \(w_1,w_2\), \(\Vert G_{t+1}(\theta ,\omega _1)-G_{t+1}(\theta ,\omega _2)\Vert \le \gamma (\vert \mathcal {A}\vert Rk_1+1)\Vert \omega _1-\omega _2\Vert ,\) and for any \(\theta _1,\theta _2\in \{\theta :\Vert \theta \Vert \le R\}\),
where \(k_3=(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1+\gamma \frac{1}{\lambda }\vert \mathcal {A}\vert (2k_1+k_2R)(r_{\max }+\gamma R+R)+\gamma \frac{1}{\lambda }(1+\vert \mathcal {A}\vert Rk_1)(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)).\)
Under Assumption 3, it can be easily shown that
It then follows that for any \(\omega _1\) and \(\omega _2\),
To show that \(G_{t+1}(\theta ,\omega ^*(\theta ))\) is Lipschitz in \(\theta\), we first show that \(\hat{\phi }_{t+1}(\theta )\) is Lipschitz in \(\theta\) following similar steps as those in (14):
We have that
where (a) can be shown following steps similar to those in (16), while (b) can be shown using
and \(\Vert \omega ^*(\theta )\Vert \le \frac{1}{\lambda }(r_{\max }+\gamma R+R).\) \(\square\)
Since \(J(\theta )\) is K-smooth, by Taylor expansion we have that
where the last inequality follows from Lemma 3.
Re-arranging the terms in (22), summing up w.r.t. t from 0 to \(T-1\), taking the expectation and applying Cauchy’s inequality implies that
We then provide the bounds on \(\mathbb {E}[\Vert \omega ^*(\theta _t)-\omega _t\Vert ^2]\) and \(\mathbb {E}\left[ \left\langle \nabla J(\theta _t),{\nabla J(\theta _t)}/{2}+G_{t+1}(\theta _t, \omega ^*(\theta _t)) \right\rangle \right]\), which we refer to as “tracking error" and “stochastic bias". We define \(\zeta (\theta , O_t)\triangleq \langle \nabla J(\theta ), \frac{\nabla J(\theta )}{2}+G_{t+1}(\theta , \omega ^*(\theta )) \rangle\), then \(\mathbb {E}_{\mu }[\zeta (\theta , O_t)]=0\) for any fixed \(\theta\) when \(O_t\sim \mu\) (which doesn’t hold under the Markovian setting). In the following lemma, we provide an upper bound on \(\mathbb {E}[\zeta (\theta , O_t)]\).
Lemma 4
Stochastic Bias. Let \(\tau _{\alpha }\triangleq \min \left\{ k: m\rho ^k \le \alpha \right\}\). If \(t \le \tau _{\alpha }\), then \(\mathbb {E}[\zeta (\theta _t,O_t)] \le k_{\zeta },\) and if \(t > \tau _{\alpha }\), then
where \(c_{\zeta }=2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }(r_{\max }+R+\gamma R)(\frac{K}{2}+k_3)+K(r_{\max }+R+\gamma R)( \frac{2\gamma }{\lambda }(1+k_1\vert \mathcal {A}\vert R)+1)\) and \(k_{\zeta }=4\gamma (1+k_1R\vert \mathcal {A}\vert )\frac{1}{\lambda }(r_{\max }+R+\gamma R)^2(2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }+1)\).
For any \(\theta _1\) and \(\theta _2\), it follows that
By Lemma 2, \(\zeta (\theta ,O_t)\) is also Lipschitz in \(\theta\): \(\vert \zeta (\theta _1,O_t)-\zeta (\theta _2,O_t)\vert \le c_{\zeta } \Vert \theta _1-\theta _2 \Vert ,\) where \(c_{\zeta }=2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }(r_{\max }+R+\gamma R)(\frac{K}{2}+k_3) +K(r_{\max }+R+\gamma R)(\gamma \frac{1}{\lambda }(1+k_1\vert \mathcal {A}\vert R)+1 +\gamma \frac{1}{\lambda }(1+Rk_1\vert \mathcal {A}\vert )).\) Thus from (25), it follows that for any \(\tau \ge 0\),
where \(c_{f_1}=r_{\max }+(1+\gamma )R+\frac{\gamma }{\lambda }(r_{\max }+(1 + \gamma )R)(1+R\vert \mathcal {A}\vert k_1)\), \(c_{g_1}=2\gamma R(1+R\vert \mathcal {A}\vert k_1)\) and \(\Vert G_{k+1}(\theta _k,\omega _k)\Vert \le c_{f_1}+c_{g_1}\).
We define an independent random variable \({\hat{O}}=({\hat{S}},{\hat{A}},{\hat{R}},{\hat{S}}')\), where \(({\hat{S}},{\hat{A}})\sim \mu\), \({\hat{S}}'\) is the subsequent state and \({\hat{R}}\) is the reward. Then \(\mathbb {E}[\zeta (\theta _{t-\tau },{\hat{O}})]=0\) by the fact that \(\mathbb {E}_{\mu }[{G_{t+1}(\theta ,\omega ^*(\theta ))}]=-\frac{1}{2}\nabla J(\theta )\). Thus for any \(\tau \le t\),
which follows from Assumption 4, and \(k_{\zeta }=4\gamma (1+k_1R\vert \mathcal {A}\vert )\frac{1}{\lambda }(r_{\max }+R+\gamma R)^2(2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }+1)\).
If \(t \le \tau _{\alpha }\), the conclusion follows from the fact that \(\vert \zeta (\theta ,O_t)\vert \le k_{\zeta }\).
If \(t > \tau _{\alpha }\), we choose \(\tau =\tau _{\alpha }\), and then \(\mathbb {E}[\zeta (\theta _t, O_t)]\le \mathbb {E}[\zeta (\theta _{t-\tau _{\alpha }},O_t)]+c_{\zeta }(c_{f_1}+c_{g_1})\sum ^{t-1}_{k=t-\tau _{\alpha }}\alpha \le k_{\zeta }\alpha +c_{\zeta }(c_{f_1}+c_{g_1})\tau _{\alpha }\alpha .\) \(\square\)
The tracking error can be bounded in the following lemma.
Lemma 5
Tracking error. (proof in Appendix 1)
where \(Q_T=\frac{\Vert z_0\Vert ^2}{1-e^{-2\lambda \beta }}+\frac{\left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha \right) }{\left( 1-e^{-2\lambda \beta }\right) ^2} +\frac{\tau _{\beta }+1}{1-e^{-2\lambda \beta }} \left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha +c_{z}\beta ^2 \right) +c_z\beta ^2 +\frac{T}{1-e^{-2\lambda \beta }} \big (2\beta \left( 4Rc_{f_2}\beta +b_{f_2}\beta \tau _{\beta }\right) + 2\beta \left( b_{g_2}\beta +b'_{g_2}\beta \tau _{\beta }\right) +2\alpha \left( 4Rb_{\eta }\beta +b'_{\eta }\beta \tau _{\beta }\right) \big )\), and \(b_{g_2}, b'_{g_2}, b_{\eta }, b'_{\eta } \text { and } c_z\) are some constants defined in Lemmas 9 and 10.
Now we have the bounds on the stochastic bias and the tracking error.
From (23), we first have that
where \(J^*=\min _{\theta } J(\theta )\) is positive and finite, and the inequality is from \(\Vert G_{t+1}(\theta ,\omega )\Vert \le r_{\max }+\gamma R+R+\gamma R(1+\vert \mathcal {A}\vert Rk_1)\). From Lemma 4, it follows that \(\sum _{t=0}^{T-1} \alpha \mathbb {E}[\zeta (\theta _t,O_t)] \le \sum ^{\tau _{\alpha }}_{t=0}\alpha k_{\zeta } +\sum ^{T-1}_{t=\tau _{\alpha }+1} (k_{\zeta }\alpha ^2+c_{\zeta }(c_{f_1}+c_{g_1})\tau _{\alpha }\alpha ^2).\) Hence, we have that
where \(\Omega \triangleq k_{\zeta }\frac{\tau _{\alpha }+1}{T}+c_{\zeta }(c_{f1}+c_{g1})\tau _{\alpha }\alpha +k_{\zeta }{\alpha }+\frac{J(\theta _0)-J^*}{T\alpha }+K\alpha \left( r_{\max }+\gamma R+R+\gamma R(1+\vert \mathcal {A}\vert Rk_1) \right) ^2\). We then plug in the tracking error in Lemma 5:
where (a) is from \(\sqrt{x+y}\le \sqrt{x}+\sqrt{y}\) for any \(x,y \ge 0\). Rearranging the terms, and choosing \(\alpha\) and \(\beta\) such that \(\gamma (1+\vert \mathcal {A}\vert Rk_1)\sqrt{\frac{8\alpha ^2}{\beta (1-e^{-2\lambda \beta })}\frac{1}{\lambda ^3}(1+\gamma +\gamma \vert \mathcal {A}\vert Rk_1)^2}<\frac{1}{4}\), then
where \(V=4\gamma (1+\vert \mathcal {A}\vert Rk_1)\left( \sqrt{\frac{2Q_T}{T}+\frac{32}{1-e^{-2\lambda \beta }}\frac{\Vert R_2\Vert ^2}{\lambda \beta }}\right)\) and \(U=4\Omega\). Hence, we have that
where (a) is from \((x+y)^2\le 2x^2+2y^2\) for any \(x,y \ge 0\), and the last step is due to the fact that \(\alpha =\mathcal {O}\left( T^{-a} \right)\), \(\beta =\mathcal {O}\left( T^{-b}\right)\), \(1-e^{-2\lambda \beta }=\mathcal {O}\left( T^{-b}\right)\), \(\frac{Q_T}{T}=\mathcal {O}\left( \frac{1}{T^{1-b}}+\frac{\log T}{T^b}\right)\), \(\frac{\Vert R_2\Vert ^2}{\beta }=\mathcal {O}\left( \frac{\alpha ^4}{\beta ^2}\right) =\mathcal {O}(T^{-2a})\) which is from \(a\ge b \ge 0\). This completes the proof of Theorem 1.
1.1 Appendix 1.1: Proof of Lemma 5
Recall that \(z_t=\omega _t-\omega ^*(\theta _t)\), then
where \(f_1(\theta _t, O_t) \triangleq \delta _{t+1}(\theta _t)\phi _t-\gamma \phi _t^\top \omega ^*(\theta _t)\hat{\phi }_{t+1}(\theta _t),\) \(g_1(\theta _t, z_t, O_t) \triangleq -\gamma \phi _t^\top z_t\hat{\phi }_{t+1}(\theta _t),\) \(f_2(\theta _t,O_t) \triangleq (\delta _{t+1}(\theta _t)-\phi _t^\top \omega ^*(\theta _t))\phi _t,\) and \(g_2(z_t,O_t) \triangleq -\phi _t^\top z_t\phi _t.\) We then develop upper bounds on functions \(f_1,g_1,f_2,g_2\) as follows.
Lemma 6
For \(\Vert \theta \Vert \le R\), \(\Vert z\Vert \le 2R\), \(\Vert f_1(\theta ,O_t)\Vert \le c_{f_1},\) \(\Vert g_1(\theta ,z,O_t)\Vert \le c_{g_1},\) \(\vert f_2(\theta ,O_t)\vert \le c_{f_2}\) and \(\vert g_2(\theta ,O_t)\vert \le c_{g_2}\), where \(c_{f_2}=r_{\max }+(1+\gamma )R+\frac{1}{\lambda }(r_{\max }+(1 + \gamma )R)\), and \(c_{g_2}=2R\).
This lemma follows from (13) (18) and (21). \(\square\)
We then decompose the tracking error as follows
where \(\bar{g}_2(z)\triangleq -Cz\), and the inequality follows from Lemmas 6 and 3.
Define \(\zeta _{f_2}(\theta ,z,O_t)\triangleq \langle z, f_2(\theta ,O_t) \rangle\), and \(\zeta _{g_2}(z,O_t)\triangleq \langle z,g_2(z,O_t)-\bar{g}_2(z)\rangle\). We then characterize the bounds on and the Lipschitz smoothness of \(\zeta _{f_2}\) and \(\zeta _{g_2}\).
Lemma 7
For any \(\theta ,\theta _1,\theta _2 \in \{\theta :\Vert \theta \Vert \le R\}\) and any \(z,z_1,z_2\in \{z:\Vert z\Vert \le 2R\}\), 1) \(\vert \zeta _{f_2}(\theta ,z,O_t) \vert \le 2Rc_{f_2}\); 2) \(\vert \zeta _{f_2}(\theta _1,z_1,O_t)-\zeta _{f_2}(\theta _2,z_2,O_t) \vert \le k_{f_2}\Vert \theta _1-\theta _2 \Vert +c_{f_2}\Vert z_1-z_2\Vert\), where \(k_{f_2}=2R(1+\gamma +\gamma Rk_1\vert \mathcal {A}\vert )(1+\frac{1}{\lambda })\); 3) \(\vert \zeta _{g_2}(z,O_t) \vert \le 8R^2\); and 4) \(\vert \zeta _{g_2}(z_1,O_t)-\zeta _{g_2}(z_2,O_t) \vert \le 8R\Vert z_1-z_2\Vert\).
1) and 3) follow directly from the definition and Lemma 6. For 2), it can be shown that
where the last inequality is from the fact that both \(\delta (\theta )\) and \(\omega ^*(\theta )\) are Lipschitz.
To prove 4), we have that
Now we are ready to bound the tracking error. Note that \(\langle z_t,\bar{g}_2(z_t)\rangle =-z_t^\top C z_t\), then (31) can be bounded as follows
Taking expectation on both sides of (34), applying it recursively and using the fact that \(1-2\beta \lambda \le e^{-2\beta \lambda }\), we obtain
and \(c_z=3\left( c_{f_2}^2+c_{g_2}^2+\frac{2}{\lambda ^2}(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)^2(c_{f_1}^2+c_{g_1}^2)\right)\).
To bound (35), we provide the following lemmas.
Lemma 8
Define \(\tau _{\beta }=\min \left\{ k: m\rho ^k \le \beta \right\}\). If \(t\le \tau _{\beta }\), then \(\mathbb {E}[\zeta _{f_2}(\theta _t,z_t,O_t)]\le 2Rc_{f_2};\) and if \(t> \tau _{\beta }\), then \(\mathbb {E}[\zeta _{f_2}(\theta _t,z_t,O_t)]\le 4Rc_{f_2}\beta +b_{f_2}\tau _{\beta }\beta ,\) where \(b_{f_2}=( c_{f_2}(c_{f_2}+c_{g_2})+ (k_{f_2}(c_{f_1}+c_{g_1})+c_{f_2}\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})))\).
We first note that
where the last step is due to (21). Furthermore, due to part 2) in Lemma 7, \(\zeta _{f_2}\) is Lipschitz in both \(\theta\) and z, then we have that for any \(\tau \le t\)
where in (a), we apply (21) and Lemma 6.
Define an independent random variable \({\hat{O}}=({\hat{S}},{\hat{A}},{\hat{R}},{\hat{S}}')\), where \(({\hat{S}},{\hat{A}})\sim \mu\), \({\hat{S}}'\sim \mathsf P(\cdot \vert {\hat{S}},{\hat{A}})\) is the subsequent state, and \({\hat{R}}\) is the reward. Then it can be shown that
where (a) is due to the fact that \(\mathbb {E}[\zeta _{f_2}(\theta _{t-\tau },z_{t-\tau },{\hat{O}})]=0\), and the last inequality follows from Assumption 4.
If \(t\le \tau _{\beta }\), the result follows due to \(\vert \zeta _{f_2}(\theta ,z_t,O_t)\vert \le 2Rc_{f_2}\).
If \(t> \tau _{\beta }\), we choose \(\tau =\tau _{\beta }\) in (37). Then,
where in the last step we upper bound \(\alpha\) using \(\beta\). Note that this will not change the order of the bound. \(\square\)
Define the following constants:
Lemma 9
Let \(\eta (\theta ,z,O_t)=\langle z, -\nabla \omega ^*(\theta )^\top (G_{t+1}(\theta ,\omega ^*(\theta ))+{\nabla J(\theta )}/{2})\rangle\), then if \(t\le \tau _{\beta }\), \(\mathbb {E}[\eta (\theta _t,z_t,O_t)]\le 2Rb_{\eta }\); and if \(t> \tau _{\beta }\), then \(\mathbb {E}[\eta (\theta _t,z_t,O_t)]\le 4Rb_{\eta }\beta +b'_{\eta }\tau _{\beta }\beta\).
From the update of \(z_t\) in (30), we first have
where the last step is due to the fact that \(\Vert f_2(\theta ,O_t)\Vert \le c_{f_2}\), \(\Vert g_2(\theta ,O_t)\Vert \le c_{g_2}\) and \(\omega ^*(\theta )\) is Lipschitz in \(\theta\) (Lemma 3).
Recall that both \({\nabla J(\theta )}/{2}\), and \(G_{t+1}(\theta ,\omega ^*(\theta ))\) are Lipschitz in \(\theta\) from (2) and (17). Also note that \(\nabla \omega ^*(\theta )=C^{-1} \nabla \mathbb {E}[\delta _{S,A,S'}(\theta )\phi _{S,A}]\), which implies that \(\Vert \nabla \omega ^*(\theta )\Vert ^2\le \frac{1}{\lambda ^2}(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\). Then \(\nabla \omega ^*(\theta )\) is Lipschitz in \(\theta\):
Consider the last term in (41). We know that \(\nabla \omega ^*(\theta )\) and \(G_{t+1}(\theta ,\omega ^*(\theta ))+\frac{\nabla J(\theta )}{2}\) are both Lipschitz in \(\theta\) from (2), (17) and (40). It can then be shown that \(\nabla \omega ^*(\theta )\left( G_{t+1}(\theta ,\omega ^*(\theta ))+\frac{\nabla J(\theta )}{2}\right)\) is also Lipschitz with constant \(\frac{1}{\lambda }(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )\left( k_3+\frac{K}{2} \right) +\left( 1+\frac{2\gamma (1+k_1R\vert \mathcal {A}\vert )}{\lambda }+\frac{1}{\lambda }\right) (r_{\max }+\gamma R+R)\frac{2}{\lambda }(\gamma \vert \mathcal {A}\vert ( k_1+k_2R))\). Plugging this into (41), we obtain that
Then for any \(\tau \ge 0\),
Define an independent random variable \({\hat{O}}=({\hat{S}},{\hat{A}},{\hat{R}},{\hat{S}}')\), where \(({\hat{S}},{\hat{A}})\sim \mu\), \({\hat{S}}'\sim \mathsf P(\cdot \vert {\hat{S}},{\hat{A}})\) is the subsequent state, and \({\hat{R}}\) is the reward. Then it can be shown that
where (a) is due to the fact that \(\mathbb {E}[\eta (\theta _{t-\tau },z_{t-\tau },{\hat{O}})]=0\), and \(b_{\eta }\triangleq \sup _{\Vert \theta \Vert \le R} \left\| \nabla \omega ^*(\theta )^\top \left( G_{t+1}(\theta ,\omega ^*(\theta ))+{\nabla J(\theta )}/{2}\right) \right\| = {1}/{\lambda } (1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )\left( 1+{1}/{\lambda }+{2\gamma (1+k_1R\vert \mathcal {A}\vert )}/{\lambda }\right) (r_{\max }+(1+\gamma )R)\).
If \(t\le \tau _{\beta }\), the conclusion is straightforward by noting that \(\vert \eta (\theta ,z,O_t)\vert \le 2Rb_{\eta }\) for any \(\Vert \theta \Vert \le R\) and \(\Vert z\Vert \le 2R\). If \(t> \tau _{\beta }\), we choose \(\tau =\tau _{\beta }\) in (42) and (43). Then, it can be shown that
The next lemma provides a bound on \(\mathbb {E}[\zeta _{g_2}(z_t,O_t)]\).
Lemma 10
If \(t\le \tau _{\beta }\), then \(\mathbb {E}[\zeta _{g_2}(z_t,O_t)] \le b_{g_2}\); and if \(t> \tau _{\beta }\), then \(\mathbb {E}[\zeta _{g_2}(z_t,O_t)] \le b_{g_2}\beta +b'_{g_2}\tau _{\beta }\beta\), where \(b'_{g_2}=8R(c_{f_2}+c_{g_2})+\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})\) and \(b_{g_2}=16R^2\).
The proof is similar to the one for Lemma 8. \(\square\)
Now we bound the terms in (35) as follows. If \(t\le \tau _{\beta }\),
If \(t>\tau _{\beta }\), we have that
Similarly, using Lemma 10, we can bound the third term in (35) as follows. If \(t\le \tau _{\beta }\), we have that
If \(t>\tau _{\beta }\), we have that
The last step to bound the tracking error is to bound \(\sum ^t_{i=0} D_{it}\), which is shown in the following lemma.
Lemma 11
If \(t\le \tau _{\beta }\), \(\sum ^t_{i=0} D_{it}\le P_t+ \frac{2Rb_{\eta }\alpha }{1-e^{-2\lambda \beta }}\); and if \(t>\tau _{\beta }\), \(\sum ^t_{i=0} D_{it}\le P_t+2Rb_{\eta }\alpha \frac{e^{-2\lambda (t-\tau _{\beta })\beta }}{1-e^{-2\lambda \beta }}+\alpha (4Rb_{\eta }\beta +b'_{\eta }\beta \tau _{\beta })\frac{1}{1-e^{-2\lambda \beta }}\), where
We first have that
where \((*)\) follows from the Taylor expansion, and \(R_2\) denotes higher order terms with \(\Vert R_2\Vert =\mathcal {O}(\alpha ^2)\).
The second expectation on the RHS of (50) can be bounded as follows
where (a) follows from \(\langle x,y\rangle \le \frac{\lambda \beta }{8}\Vert x\Vert ^2+\frac{2}{\lambda \beta }\Vert y\Vert ^2\) for any \(x,y \in \mathbb {R}^N\), \(\Vert x+y+z\Vert ^2\le 4\Vert x\Vert ^2+4\Vert y\Vert ^2+4\Vert z\Vert ^2\) for any \(x,y,z \in \mathbb {R}^N\), and Lemma 3.
Thus, we have that
With Lemma 9, this concludes the proof. \(\square\)
We then consider the tracking error \(\mathbb {E}[\Vert z_{t} \Vert ^2]\) in (35). Combining all the bounds in (45) (46) (47) (48) and Lemma 11, we have that if \(t\le \tau _{\beta }\),
where \(\Omega _1\triangleq \frac{1}{1-e^{-2\lambda \beta }} ( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha +c_z\beta ^2)\); and if \(t> \tau _{\beta }\),
where \(\Omega _2\triangleq \frac{1}{1-e^{-2\lambda \beta }} (2\beta (4Rc_{f_2}\beta +b_{f_2}\beta \tau _{\beta })+ 2\beta (b_{g_2}\beta +b'_{g_2}\beta \tau _{\beta })+2\alpha (4Rb_{\eta }\beta +b'_{\eta }\beta \tau _{\beta })+c_z\beta ^2 )\). We then bound \(\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]\). The sum is divided into two parts \(\sum ^{\tau _{\beta }}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]\) and \(\sum ^{T-1}_{t=\tau _{\beta }+1}\mathbb {E}[\Vert z_t\Vert ^2]\) as follows
Let \(Q_T\triangleq \frac{\Vert z_0\Vert ^2}{1-e^{-2\lambda \beta }}+\frac{\left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha \right) }{\left( 1-e^{-2\lambda \beta }\right) ^2}+(1+\tau _\beta )\Omega _1+(T-\tau _\beta )\Omega _2.\) Then, \(\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]\le 2\sum ^{T-1}_{t=0}P_t+Q_T.\)
Now we plug in the exact definition of \(P_t\),
where the last step is from the double sum trick: for any \(x_t\ge 0\) \(\sum ^{T-1}_{t=0}\sum ^t_{i=0} e^{-2\lambda (t-i)\beta }x_i \le \frac{1}{1-e^{-2\lambda \beta }}\sum ^{T-1}_{t=0}x_t\). Choose \(\beta\) such that \(\big (\frac{\lambda \beta }{8}+\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}\gamma ^2\left( 1+k_1R\vert \mathcal {A}\vert \right) ^2(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\big )\frac{1}{1-e^{-2\lambda \beta }}<\frac{1}{4}\). Then it follows that
where the last step is from \(1-e^{-2\lambda \beta }=\mathcal {O}(\beta )\) and \(\Vert R_2\Vert ^2=\mathcal {O}(\alpha ^4)\). This hence completes the proof of Lemma 5.
Appendix 2: Analysis for nested-loop Greedy-GQ
1.1 Appendix 2.1: Proof of Theorem 2
Define \(\hat{G}_t(\theta ,w)=\frac{1}{M}\sum ^M_{i=1} G_{(BT_c+M)t+BT_c+i}(\theta ,w).\) By the K-smoothness of \(J(\theta )\), and following steps similar to those in the proof of Theorem 1, we have that
By the definition, we have that
For any \(\Vert \theta \Vert \le R\) and any \(\omega _1, \omega _2\), \(\Vert G_{(BT_c+M)t+BTc+i}(\theta ,w_1)-G_{(BT_c+M)t+BTc+i}(\theta ,w_2) \Vert \le \gamma (1+\vert \mathcal {A}\vert Rk_1)\Vert w_1-w_2\Vert\). Hence we have that
Thus \(\Vert \hat{G}_t(\theta _t,\omega _t)-\hat{G}_t(\theta _t,\omega ^*(\theta _t))\Vert ^2\le \gamma ^2(1+\vert \mathcal {A}\vert Rk_1)^2\Vert z_t\Vert ^2.\) Plugging this in (57), we have that
The following lemma provide the upper bounds on the two terms (proof in Appendix 2).
Lemma 12
For any \(t\ge 1\),
where \(k_h=\frac{32R^2(1+\rho m-\rho )}{1-\rho }\), \(k_G=8(r_{\max }+\gamma R+R)^2\left( 1+\frac{1}{\lambda }+\frac{2\gamma }{\lambda }(1+Rk_1\vert \mathcal {A}\vert )\right) ^2(1+\rho (m-1))\) and \(k_l=\frac{8(1+\lambda )^2(r_{\max }+R+\gamma R)^2(1+\rho m-\rho )}{1-\rho }\). If we further let \(T_c=\mathcal {O}\left( \log \frac{1}{\epsilon }\right)\) and \(B=\mathcal {O}\left( \frac{1}{\epsilon }\right)\), then \(\mathbb {E}[\Vert z_{T_c}\Vert ^2]\le \mathcal {O}\left( \epsilon \right)\).
Now we have the bound above, we hence plug them in (60) and sum up w.r.t. t from 0 to \(T-1\). Then
which implies that
where \(L=\gamma (1+\vert \mathcal {A}\vert Rk_1)\). Now let \(T, M, B=\mathcal {O}\left( \frac{1}{\epsilon }\right)\) and \(T_c=\mathcal {O}(\log (\epsilon ^{-1}))\), we have \(\mathbb {E}[\Vert \nabla J(\theta _W)\Vert ^2] \le \epsilon ,\) with the sample complexity \(\left( M+T_cB\right) T=\mathcal {O}\left( {\epsilon ^{-2}}{\log {\epsilon }^{-1}}\right) .\)
1.2 Appendix 2.2: Proof of Lemma 12
Define \(z_{t,t_c}=\omega _{t,t_c}-\omega ^*(\theta _t)\). Then by the update of \(\omega _{t,t_c}\), we have that for any \(t\ge 0\),
where \(l_{t,t_c,i}(\theta _t)=(\delta _{(BT_c+M)t+Bt_c+i}(\theta _t)-\phi ^\top _{(BT_c+M)t+Bt_c+i-1}\omega ^*(\theta _t))\phi _{(BT_c+M)t+Bt_c+i-1}\), and \(h_{t,t_c,i}(z_{t,t_c})=\phi _{(BT_c+M)t+Bt_c+i-1}^\top z_{t,t_c}\phi _{(BT_c+M)t+Bt_c+i-1}\). We also define the expectation of the above two functions under the stationary distribution for any fixed \(\theta\) and z: \(\bar{l}(\theta )=\mathbb {E}_{\mu }[l_{t,t_c,i}(\theta )]=0\) and \(\bar{h}(z)=\mathbb {E}_{\mu }[h_{t,t_c,i}(z)]=Cz\). We then have that
where (a) is from \(\langle z, \bar{h}(z)\rangle =z^\top C z\ge \lambda \Vert z\Vert ^2\), \(\Vert \bar{h}(z)\Vert ^2=z^\top C^\top C z \le \Vert z\Vert ^2\) for any \(z\in \mathbb {R}^N\), and \(\langle x, y\rangle \le \frac{\lambda }{4}\Vert x\Vert ^2+ \frac{1}{\lambda }\Vert y\Vert ^2\) for any \(x,y \in \mathbb {R}^N\). Recall that \(\mathcal {F}_t\) is the \(\sigma\)-field generated by the randomness until \(\theta _t\) and \(\omega _t\), hence taking expectation conditioned on \(\mathcal {F}_t\) on both sides implies that
From Lemma 13, it follows that \(\mathbb {E}[\Vert z_{t,t_c+1}\Vert ^2\vert \mathcal {F}_t] \le (1+4\beta ^2-\beta \lambda )\mathbb {E}[\Vert z_{t,t_c} \Vert ^2\vert \mathcal {F}_t]+\left( 4\beta ^2+\frac{2\beta }{\lambda }\right) \frac{k_l+k_h}{B}.\) Choose \(\beta <\frac{\lambda }{4}\) and recursively apply the inequality, it follows that
which is from \(1-x\le e^{-x}\) for any \(x>0\) and \(\Vert z_{t,0} \Vert ^2\le 4R^2\). Thus, let \(T_c=\mathcal {O}\left( \log \frac{1}{\epsilon }\right) ,B=\mathcal {O}\left( \frac{1}{\epsilon }\right)\), then \(\mathbb {E}[\Vert z_{t}\Vert ^2]\le \mathcal {O}\left( \epsilon \right)\). This completes the proof of (61).
1.3 Appendix 3.3: Lemma 13 and its proof
We now present bounds on the “variance terms” in (66).
Lemma 13
Consider the Markovian setting, then
Note that \(\bar{l}(\theta )=\mathbb {E}_{\mu }[l_{t,t_c,i}(\theta )]=0\), thus
which is due to the fact that \(\vert l_{s,a,s'}(\theta )\vert \le (1+\lambda )(r_{\max }+R+\gamma R)\) for any \((s,a,s')\) and \(\Vert \theta \Vert \le R.\)
For the second part, we first consider the case \(i>j\). Let \(X_j\) be the \((BT_ct+Mt+Bt_c+j)\)-th sample and \(X_i\) be the \((BT_ct+Mt+Bt_c+i)\)-th sample, and we denote the \(\sigma -\)field generated by all the randomness until \(X_j\) by \(\mathcal {F}_{t,t_c,j}\), then
where the last inequality is from the geometric uniform ergodicity of the MDP. Thus we have that
Similarly we can show the other two inequalities. \(\square\)
