Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Stochastic Optimization Algorithms For Instrumental Variable Regression With Streaming Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Stochastic Optimization Algorithms for

Instrumental Variable Regression with Streaming Data

Xuxing Chen♯ * Abhishek Roy♯ † Yifan Hu‡ Krishnakumar Balasubramanian§

May 31, 2024


arXiv:2405.19463v1 [stat.ML] 29 May 2024

Abstract
We develop and analyze algorithms for instrumental variable regression by viewing the problem
as a conditional stochastic optimization problem. In the context of least-squares instrumental variable
regression, our algorithms neither require matrix inversions nor mini-batches and provides a fully online
approach for performing instrumental variable regression with streaming data. When the true model is
linear, we derive rates of convergence in expectation, that are of order O(log T /T ) and O(1/T 1−ι ) for
any ι > 0, respectively under the availability of two-sample and one-sample oracles, respectively, where
T is the number of iterations. Importantly, under the availability of the two-sample oracle, our procedure
avoids explicitly modeling and estimating the relationship between confounder and the instrumental
variables, demonstrating the benefit of the proposed approach over recent works based on reformulating
the problem as minimax optimization problems. Numerical experiments are provided to corroborate the
theoretical results.

1 Introduction
Instrumental variable analysis is widely used in fields like econometrics, health care, social science, and on-
line advertisement to estimate the causal effect of a random variable, X, on an outcome variable, Y , when
an unobservable confounder influences both. By identifying an instrumental variable correlated with the
variable X but unrelated to the confounders, researchers can isolate the exogenous variation in X and esti-
mate a causal relationship between X and Y . In the context of regression, Instrumental Variable Regression
(IVaR) addresses endogeneity issues when an independent variable is correlated with the error term in the
regression model, leveraging an instrument variable Z such that Y is independent of X|Z. In this paper, we
focus on the following statistical model:

Y = gθ∗ (X) + ϵ1 with X = hγ ∗ (Z) + ϵ2 (1)

where X ∈ Rdx and ϵ1 are correlated and ϵ2 is a centered unobserved noise (independent of Z ∈ Rdz ),
leading to confounding in the model between X and Y ∈ R. Here ϵ1 and ϵ2 are dependent, and θ∗ and
γ ∗ are true parameters for the respective function g and h. Our goal is to design efficient algorithms that
recovers θ∗ from the data.
* Department of Mathematics, University of California, Davis. Email: xuxchen@ucdavis.edu.

Halıcıoğlu Data Science Institute, University of California, San Diego. Email: a2roy@ucsd.edu.

College of Management of Technology, EPFL, Department of Computer Science, ETH Zurich, Switzerland. Email: yi-
fan.hu@epfl.ch. YH is supported by NCCR Automation of Swiss National Science Foundation.
§
Department of Statistics, University of California, Davis. Email: kbala@ucdavis.edu. KB is supported by NSF grant DMS-
2053918.

XC and AR contributed equally to this work.

1
Traditionally, IVaR algorithms are based on two-stage estimation procedures, where we first regress Z
and X to obtain an estimator X,b and then regress X b and Y , with the essence that X
b is independent of Y , and
thus eliminating the aforementioned endogeneity of the unknown confounder. A vast literature has devoted
to understanding the two-stage approaches (Hall and Horowitz, 2005; Darolles et al., 2011; Hartford et al.,
2017), with the parametric two-stage least-squares (2SLS) procedure being the most canonical one (Angrist
and Imbens, 1995). The main drawback of this approach is that the second-stage regression problem is
affected by the estimation error from the regression problem corresponding to first stage. In fact, Angrist
and Pischke (2009) call the first stage regression as “forbidden regression”, due to the concerns in estimating
a nuisance parameter.
Considering the squared loss function, Muandet et al. (2020) formulate the IVaR problem as a condi-
tional stochastic optimization problem (Hu et al., 2020b):

min F (g) := EZ EY |Z [(Y − EX|Z [g(X)])2 ]. (2)


g∈G

However, Muandet et al. (2020) did not solve problem (2) efficiently, and resort to reformulating (2) further
as a minimax optimization problem. Indeed, they mention explicitly in their work that “it remains cumber-
some to solve (2) directly because of the inner expectation”. Then, they leverage the Fenchel conjugate of
the squared loss, leading to a minimax optimization with maximization over a continuous functional space.
Following Dai et al. (2017), Muandet et al. (2020) propose to use reproducing kernel Hilbert space (RKHS)
to handle the maximization over continuous functional space. See also Lewis and Syrgkanis (2018); Bennett
et al. (2019); Dikkala et al. (2020); Liao et al. (2020); Bennett et al. (2023) for similar minimax approaches.
The issue with such an approach is that approximating the dual variable via maximization over continuous
functional space inevitably introduces approximation error. Hence, although there is no explicit nuisance
parameter estimation step like in the two-stage approach, there is an implicit one, which makes the minimax
approach less appealing as an alternate to the two-stage procedures.
In this work, contrary to the claim made in Muandet et al. (2020) that problem (2) is cumbersome
to solve, we design and analyze efficient streaming algorithms to directly solve the conditional stochastic
optimization problem in (2). Direct application of methods from Hu et al. (2020b) for solving (2) is possible,
yet their approach utilizes nested sampling, i.e., for each sample of Z, Hu et al. (2020b) generate a batch
of samples of X from P(Z|X), to reduce the bias in estimating the composition of non-linear loss function
with conditional expectations. Thus their methods are not suitable for the streaming setting that we are
interested in. Considering (2), we first parameterize the function class G := {g(θ; X) | θ ∈ Rdθ }. Now,
defining F (g) := F (θ), we observe that the gradient ∇F (θ) admits the following form

∇F (θ) = EZ [(EX|Z [g(θ; X)] − EY |Z [Y ])∇θ EX|Z [g(θ; X)]], (3)

which implies that one does not need the nested sampling technique to reduce the bias. However, the
presence of product of two conditional expectations EX|Z [g(θ; X)] still causes significant challenges in de-
veloping stochastic estimators of the above gradient in the streaming setting. In this work, we overcome this
challenge and develop two algorithms that are applicable to the streaming data setting avoiding the need for
generating batches of samples of X from P(Z|X).

Contributions. We make the following contributions in this work.

• Two-sample oracles: Our first algorithm leverages the observation that if we have access to a two-
sample oracle that outputs two samples X and X ′ that are independent conditioned on the instrument Z,
we can immediately construct an unbiased stochastic gradient estimator of the gradient in (3). Based on
this crucial observation, we propose the Two-Sample One-stage Stochastic Gradient IVaR (TOSG-IVaR)
method (Algorithm 1) that avoids explicitly having to estimate or model the relationship between Z and X

2
thereby overcoming the “forbidden regression” problem.. Under standard statistical model assumptions,
for the case when g is a linear model, we establish rates of convergence of order O(log T /T ) for the
proposed method, where T is the overall number of iterations; see Theorem 1.

• One-sample oracles: In the case when we do not have the aforementioned two-sample oracle, we esti-
mate the stochastic gradient in (3) by using the streaming data to estimate one of the conditional expec-
tations, and the corresponding prediction to estimate the other, resulting in the One-Sample Two-stage
Stochastic Gradient IVaR (OTSG-IVaR) method (Algorithm 2). Assuming further that the X depends
linearly on the instrument Z, we establish a rate of convergence of order O(1/T 1−ι ), for any ι > 0; see
Theorem 2.

1.1 Literature Review


IVaR analysis. Instrumental variable analysis has a long history, starting from the early works by Wright
(1928) and Reiersøl (1945). Several works considered the aforementioned two-stage procedure for IVaR; a
summary could be found in the work by Angrist and Pischke (2009). Nonparametric approaches based on
wavelets, splines, reproducing kernels and deep neural networks could be found, for example, in the works
by Hartford et al. (2017); Singh et al. (2019); Bennett et al. (2019); Muandet et al. (2020); Mastouri et al.
(2021); Xu et al. (2021); Zhu et al. (2022); Peixoto et al. (2024). Another popular approach for IVaR is via
Generalized Method of Moments (GMM); see, for example, Chen and Pouzo (2012); Bennett et al. (2019);
Dikkala et al. (2020) for an overview. Such approaches essentially reformulate the problem as a minimax
problem and hence suffer from the aforementioned “forbidden regression” problem.

Identifiability conditions for IVaR. Several works in the literature have also focused on establishing the
identifiability conditions for IVaR in the parametric and the nonparametric setting. Regardless of the proce-
dure used, they are invariably based on certain source conditions motivated by the inverse problems literature
(see, for example, Carrasco et al. (2007); Chen and Reiss (2011); Bennett et al. (2023)) or the related prob-
lem of completeness conditions, which posits that the conditional expectation operator is one-to-one (Babii
and Florens, 2017; Liao et al., 2020). Semi-parametric identifiability is also considered recently in the work
of Cui et al. (2023). Our focus in this work is not focused on the identifiability; for the formulation (2) that
we consider, Muandet et al. (2020) provide necessary conditions for identifiability that we adopt.

Stochastic optimization with nested expectations. Recently, much attention in the stochastic optimization
literature has focused on optimizing a nested composition of T expectation functions. Sample average ap-
proximation algorithms in this context are considered in the works of Ermoliev and Norkin (2013) and Hu
et al. (2020a). Optimal iterative stochastic optimization algorithms for the case of T = 2 were by derived
by Ghadimi et al. (2020). For the general T ≥ 1 case, Wang et al. (2017) provided sub-optimal rates,
whereas Balasubramanian et al. (2022) derived optimal rates; see also Zhang and Xiao (2021) and Chen
et al. (2021) for related works under stronger assumptions, and Ruszczynski (2021) for similar asymptotic
results. While the above works required certain independence assumptions regarding the randomness across
the different compositions, Hu et al. (2020b, 2024) studied the case of T = 2 where the the randomness
are generically dependent. They termed this problem setting as conditional stochastic optimization, which
is the framework that the IVaR problem in (2) falls in. Compared to prior works, for e.g., Ghadimi et al.
(2020) and Balasubramanian et al. (2022), in order to handle the dependency between the levels, Hu et al.
(2020b) require mini-batches in each iteration, making their algorithm not immediately applicable to the
purely streaming setting. In this work, we show that despite the problem (2) being a conditional stochastic
optimization problem, mini-batches are not required due the additional favorable quadratic structure avail-
able in IVaR.

3
Streaming IVaR. Venkatraman et al. (2016); Della Vecchia and Basu (2024) analyzed streaming versions
of 2SLS in the online1 and adversarial settings. Focusing on linear models, Venkatraman et al. (2016) pro-
vide preliminary asymptotic analysis assuming access to efficient no-regret learners, while Della Vecchia
and Basu (2024) provide regret bounds under the strong assumption that the instrument is almost surely
bounded. Furthermore, our algorithms have significantly improved per-iteration and memory complexity
compared to Della Vecchia and Basu (2024); see Sections A and B for details. Chen et al. (2023) developed
stochastic optimization algorithms for the GMM formulation and provide asymptotic analysis. Their algo-
rithm requires access to an offline dataset for initialization and is hence not fully online. The above works
(i) do not focus on avoiding the forbidden regression problem and (ii) do not view IVaR via the conditional
stochastic optimization lens, like we do.

2 Two-sample One-stage Stochastic Gradient Method for IVaR


Recall that our goal is to solve the objective function given in (2). By Muandet et al. (2020, Theorem 4), the
optimal solution of (2) gives the true underlying causal relationship under the following assumption.
Assumption 2.1. (Identifiability Assumption)
• The conditional distribution PZ|X is continuous in Z for any value of X.
• The function class G := {g(θ; X) | θ ∈ Rdθ } is correctly specified, i.e., it includes the true underlying
relationship between X and Y .
Notice that both assumptions are standard in the IVaR literature (Newey and Powell, 2003; Chen and
Pouzo, 2012; Muandet et al., 2020), and makes the objective in (2) is the meaningful for IVaR. However,
Muandet et al. (2020) resort to reformulating the objective function in (2) as a minimax optimization prob-
lem as described in Section 1. While their original motivation was to avoid two-state estimation procedure
and avoid the “forbidden regression”, their minimax reformulation ends up having to solve a complicated
approximation of the original objective resulting in having to characterize the approximation error which is
non-trivial.

Algorithm and Analysis. Our aim in this work is to directly solve the original problem in (2), lever-
aging the structure provided by the quadratic loss. Given the gradient formulation in (3), a natural way to
build unbiased gradient estimator is to generate X and X ′ , two independent samples of X from the condi-
tional distributions PX|Z , for a given realization of Z and generate one sample of Y from the conditional
distribution PY |X . Then, an unbiased gradient estimator is

v(θ) = (g(θ; X) − Y )∇θ g(θ; X ′ ). (4)

This could be plugged into the standard stochasic gradient descent algorithm, which give us the Two-sample
Stochastic Gradient Method for IVR (TSG-IVaR) method illustrated in Algorithm 1. In particular, the
algorithm never requires estimating (or modeling) the relationship between X and Z as needed in the two-
stage procedure (Angrist and Pischke, 2009) and the minimax formulation based procedures (Muandet et al.,
2020; Lewis and Syrgkanis, 2018; Bennett et al., 2019; Dikkala et al., 2020; Liao et al., 2020; Bennett et al.,
2023). Furthermore, this viewpoint not only provides a novel algorithm for performing IV regression, but
also provides a novel data collection mechanism for the practical implementation of IVaR. In addition, such
a two-sample gradient method is not very restrictive when the instrumental variable Z takes value in a
discrete set. In this case, to implement the two-sample oracle, it is enough simply pick two sets of samples
1
Their notion of online is from the literature on online learning (Shalev-Shwartz et al., 2012).

4
Algorithm 1 Two-sample One-stage Stochastic Gradient-IVaR (TOSG-IVaR)
Input: ♯ of iterations T , stepsizes {αt }Tt=1 , initial iterate θ1 .
1: for t = 1 to T do
2: Sample Zt , sample independently Xt and Xt′ from PX|Zt , and sample Yt from PY |Xt .
3: Update θt
θt+1 = θt − αt+1 (g(θt ; Xt ) − Yt )∇θ g(θt ; Xt′ ).
4: end for
Output: θT .

(X, Y, Z) and (X ′ , Y ′ , Z) for which Z has repeated observations (which is possible when Z is a discrete
random variable) from a pre-collected dataset. To demonstrate the convergence rate of Algorithm 1, we first
consider the case when g is a linear function, i.e., g(θ; X) = X ⊤ θ. We make the following assumptions.
h i
Assumption 2.2. Suppose there exists µ > 0 such that EZ EX|Z [X] · EX|Z [X]⊤ ⪰ µI.

Assumption 2.3. Let (ϑ1 , ϑ2 , ϑ3 , ϑ4 ) ∈ R4+ . For any Z, X ′ and X i.i.d. generated from PZ|X , and Y
generated from PY |X , and for constants Cx , Cy , Cxx , Cyx > 0, we have
h i
E ∥X ′ X ⊤ − EX|Z [X]EX|Z [X]⊤ ∥2 ≤ Cx dϑx1 , (5)
h i
E ∥Y X − EY |Z [Y ]EX|Z [X]∥2 ≤ Cy dϑx2 , (6)
h h i i
E ∥EX|Z [X] · EX|Z [X]⊤ − EZ EX|Z [X] · EX|Z [X]⊤ ∥2 ≤ Cxx dϑz 3 , (7)
h h i i
E ∥EY |Z [Y ] · EX|Z [X] − EZ EY |Z [Y ] · EX|Z [X] ∥2 ≤ Cyx dϑz 4 , (8)

where ∥ · ∥ denotes the Euclidean norm and operator norm for a vector and matrix respectively.
The above assumptions are mild moment assumptions required on the involved random variables. The
following result demonstrates that Assumptions 2.2 and 2.3 are naturally satisfied under even under non-
linear modeling assumption on (1). We defer its proof to Section D.3.
Lemma 1. Suppose there exist θ∗ ∈ Rdx , γ∗ ∈ Rdz ×dx , a non-linear map ϕ : Rdx → Rdx , and a positive
semi-definite matrix Σ ∈ Rdz ×dz such that
h i
EZ ϕ(γ∗⊤ Z) · ϕ(γ∗⊤ Z)⊤ ⪰ µI, E[∥ϕ(γ∗⊤ Z)∥2 ] = O(dx ),
Z ∼ N (0, Σ), X = ϕ(γ∗⊤ Z) + ϵ2 , Y = θ∗⊤ X + ϵ1 , ϵ2 ∼ N (0, σϵ22 Idx ), ϵ1 ∼ N (0, σϵ21 ), (9)
where ϵ1 , ϵ2 are independent of Z and
h i
E ϵ21 ∥ϵ2 ∥2 ≤ σϵ21 ,ϵ2 dx , E ∥ϕ(γ∗⊤ Z) · ϕ(γ∗⊤ Z)⊤ − E[ϕ(γ∗⊤ Z) · ϕ(γ∗⊤ Z)⊤ ]∥2 ≤ Cdz ,
 
(10)

then Assumptions 2.2 and 2.3 hold with ϑ1 = ϑ2 = 2 and ϑ3 = ϑ4 = 1. if ϕ is an identity map, then the
conditions involving ϕ become γ∗⊤ Σγ∗ ⪰ µI, tr(γ∗⊤ Σγ∗ ) = O(dx ), E ∥ZZ ⊤ − Σ∥2 ≤ Cdz .


Assumption 2.4. The tuple (Zt , Xt , Xt′ , Yt ) is independent and identically distributed, across t.
The above assumption is standard in the stochastic approximation, statistics and econometrics literature.
It could be further relaxed to Markovian-type dependency assumptions, following techniques in the works
of Duchi et al. (2012); Sun et al. (2018); Even (2023); Roy et al. (2022); we leave a detailed examination of
the Markovian streaming setup as future work. Under the above assumptions, we have the following result
demonstrating the last-iterate global convergence of Algorithm 1.

5
Theorem 1. Suppose Assumptions 2.2, 2.3, and 2.4 hold. In Algorithm 1, defining σ12 := 2Cx dϑx1 +2Cxx dϑz 3
and σ22 := Cy dϑx2 + Cyx dϑz 4 , set αt ≡ α = log T µ
µT ≤ µ2 +3σ 2 . Then, we have
1

2
 
E ∥θ 0 − θ ∗ ∥ 3∥θ∗ ∥2 (σ12 + σ22 ) log T
E ∥θT − θ∗ ∥2 ≤
 
+ .
T µ2 T
Proof techniques. In the analysis of Theorem 1, the following decomposition (see (18) for the derivation)
plays a crucial role:

θt+1 − θ∗ = At + αt+1 Bt ,
h i h i
At = θt − αt+1 EZ EX|Z [X] · EX|Z [X]⊤ θt + αt+1 EZ EY |Z [Y ] · EX|Z [X] − θ∗ ,
 h i  h i
Bt = − Xt′ Xt⊤ − EZ EX|Z [X] · EX|Z [X]⊤ θt + Yt Xt′ − EZ EY |Z [Y ] · EX|Z [X] ,

where At corresponds to deterministic component, and Bt corresponds to the stochastic component aris-
ing due to the use of stochastic gradients. Standard assumptions on the variance of the stochastic gradient
made in the stochastic optimization literature include the uniformly bounded variance assumption (Lan,
2020) and the expected smoothness condition (Khaled and Richtárik, 2020). In the IVaR setup, such stan-
dard assumptions do not hold as θt potentially can be unbounded and thus the gradient estimator can be
unbounded. Hence, we establish our results under natural statistical assumptions arising in the context of
the IVaR problem, which form the main novelty in our analysis. Furthermore, compared to Muandet et al.
(2020), notice that we use two samples of X from the conditional distribution PX|Z and achieve an O(1/Te )
last iterate convergence rate to the global optimal solution, which is the true underlying causal relationship
under Assumption 2.1. In comparison, Muandet et al. (2020) only provide asymptotic convergence result to
the optimal solution of an approximation problem.

Additional discussion. It is interesting to explore other losses beyond squared loss (for example to
handle classification setting (Centorrino and Florens, 2021)), potentially using the Multilevel Monte Carlo
(MLMC) based stochastic gradient estimators. While Hu et al. (2021), develops such algorithms, the main
challenge is about how to avoid mini-batches required in their work leveraging the problem structure in
instrumental variable analysis. Furthermore, in the case when g(θ; X) is parametrized by a non-linear
models, for instance, a neural network, we provide local convergence guarantees under additional stronger
conditions made typically in the stochastic optimization literature.

Assumption 2.5. Let the following assumptions hold:

• Function F (θ) is ℓ-smooth.


• The iterates {θt }Tt=1
+1
generated by Algorithm 1 are in a compact set A.
• The random objects X|Z and Y |Z have bounded variance for any Z, i.e., there exist σ > 0 such that

E ∥X − E [X | Z] ∥2 | Z ≤ σ 2 , E ∥Y − E [Y | Z] ∥2 | Z ≤ σ 2 .
   

 
Proposition 1. Suppose Assumptions 2.1, 2.4, and 2.5 hold. Choosing αt ≡ α = O √1 , for Algorithm 1
T
we have
 
 2
 1
min E ∥∇F (θt )∥ = O √ .
1≤t≤T T

6
The proof of the proposition is immediate. Note that under Assumption 2.5, we can deduce that the
unbiased gradient estimator v(θ) = (g(θ; X) − Y )∇θ g(θ; X ′ ) has a bounded variance since

Var(v(θ)) =Var(g(θ; X) − Y )Var(∇θ g(θ; X ′ ))


2
+ Var(g(θ; X) − Y ) E ∇θ g(θ; X ′ ) + Var(∇θ g(θ; X ′ )) (E [g(θ; X) − Y ])2 ≤ σv2 ,


where the variance and expectation are taken conditioning on Z and θ, and σv > 0 is a constant that only
depends on σ, function g and the compact set A in Assumption 2.5. Then one can directly follow the analysis
of non-convex stochastic optimization (see, for example, Ghadimi and Lan (2013, Theorem 2.1)) to obtain
Proposition 1. Relaxing the Assumption 2.5 (typically made in the stochastic optimization literature) with
more natural assumptions on the statistical model and obtaining a result as in Theorem 1 for the non-convex
setting is left as future work.

3 One-sample Two-stage Stochastic Gradient Method for IVaR


We now examine designing streaming IVaR algorithm with access to the classical one-sample oracle, i.e.,
we observe a streaming set of samples (Xt , Yt , Zt ) at each time point t. Note that in this case, using the
same Xt (instead of Xt′ ) in (4) makes the stochastic gradient estimator biased.

Intuition. Consider the case of linear models, i.e., Y = θ∗⊤ X + ϵ1 with X = γ∗⊤ Z + ϵ2 , where θ∗ ∈ Rdx ×1 ,
and γ∗ ∈ Rdz ×dx , as also considered in Lemma 1. Recall the true gradient in (3) and the stochastic gradient
estimator of Algorithm 1 in (4). Since we no longer have Xt′ , we replace the term Xt′ with the predicted mean
of Xt given Zt . Suppose that γ∗ is known. We specifically replace ∇θt g(θt ; Xt′ ) = Xt′ by E|Zt [Xt ] = γ∗⊤ Zt .
In such a case, indeed we have an unbiased gradient estimator:
h i h i
Et γ∗⊤ Zt (Xt⊤ θt − Yt ) = Et E|Zt [Xt ] (E|Zt [Xt ]⊤ θt − E|Zt [Yt ])
h i
=Et γ∗⊤ Zt Zt⊤ γ∗ (θt − θ∗ ) = γ∗⊤ ΣZ γ∗ (θt − θ∗ ) = ∇θ F (θt ),

where Et [·] is the conditional expectation w.r.t the filtration defined on {γ1 , θ1 , γ2 , θ2 , · · · , γt , θt }.
In reality, γ∗ is unknown beforehand. Hence, we estimate γ∗ using some online procedure and replace
∇θt g(θt ; Xt′ ) by γt ⊤ Zt instead of γ∗⊤ Zt . It leads to the following updates:

θt+1 = θt − αt+1 γt⊤ Zt (Xt⊤ θt − Yt ), γt+1 = γt − βt+1 Zt (Zt⊤ γt − Xt⊤ ). (11)

A closer inspection reveals that the updates in (11) can diverge until γt is close enough to γ∗ . It is easy to
see this fact from the following expansion of θt+1 − θ∗ . We have
b t (θt − θ∗ ) + αt+1 (γt − γ∗ )⊤ ΣZY + αt+1 Dt θ∗ + αt+1 γt ⊤ ξZt γ∗ (θt − θ∗ )
θt+1 − θ∗ =Q
+ αt+1 γt ⊤ ξZt γ∗ θ∗ + αt+1 γt ⊤ ξZt Yt − αt+1 γt ⊤ Zt ϵ2,t ⊤ θt , (12)

where
 
ξZt = ΣZ − Zt Zt⊤ , ξZt Yt = ΣZY − Zt Yt , b t := I − αt+1 γt ⊤ ΣZ γ∗ .
Q

However, the matrix γt ⊤ ΣZ γ∗ may not be positive semi-definite, even if ΣZ is positive definite. Thus
the negative eigenvalues associated with γt ⊤ ΣZ γ∗ might cause the θt iterates to first diverge, before even-
tually converging as γt gets closer to γ∗ . We illustrate this intuition in a simple experiment in Figure 1.
To resolve this issue, we propose Algorithm 2, where we replace g(θt , Xt ) = Xt⊤ θt with ZtT γt θt in

7
8 OTSG-IVaR
CSO -- Eq. (11)
6

* 2)
2

log(
2

6
0 1 2 3 4 5
log (t)

Figure 1: (11) can initially diverge before converging eventually, leading to a worse performance in practical
settings compared to Algorithm 2. See Appendix C.2 for the experimental setup.
.

Algorithm 2 One-Sample Two-stage Stochastic Gradient-IVarR (OTSG-IVaR)


Input: Stepsizes {αt }t , {βt }t , initial iterates γ1 , θ1 .
1: for t = 1, 2, · · · do
2: Sample Zt , sample Xt from PX|Zt , Sample Yt from PY |Xt .
3: Update

θt+1 = θt − αt+1 γt⊤ Zt (Zt⊤ γt θt − Yt ), (13)


γt+1 = γt − βt+1 Zt (Zt⊤ γt − Xt⊤ ). (14)

4: end for

(11). With such a modification, in the corresponding decomposition for θt+1 − θ∗ (see (40)), we have
Qt = I − αt+1 γt ΣZ γt , where the matrix product γt ⊤ ΣZ γt is always positive semi-definite. Hence,


b
with a properly chosen stepsize αt we could quantify the convergence of θt to θ∗ non-asymptotically. Nev-
ertheless, assuming a warm-start condition on θ0 , we also show the convergence of (11), in Appendix E.3
for completeness.

Algorithm and Analysis. Based on the intuition, we present Algorithm 2. One could interpret the
algorithm as the SGD analogy of the offline 2SLS algorithm (Angrist and Imbens, 1995). It is also related
to the framework of non-linear two-stage stochastic approximation algorithms (Doan and Romberg, 2020;
Dalal et al., 2018; Mokkadem and Pelletier, 2006); albeit the updates of θt and γt are coupled since both
updates use Zt . Furthermore, the dependency between the randomness between the two stages in the IVaR
problem, makes the analysis significantly different and more challenging from the classical analysis of two-
stage algorithms (see below Theorem 2 for additional details). Finally, while Algorithm 2 is designed for
linear models, the intuition behind the method is also applicable to non-linear models (i.e., between Z and
X, and X and Y ). We focus on linear models in this work in order to derive our theoretical results. A
detailed treatment of the nonlinear case (for which the analysis is significantly nontrivial) is left for future
work. We make the following additional assumptions for the convergence analysis of Algorithm 2.

8
Assumption 3.1. For some constants Cz , Czy > 0, we have the following bounds on the fourth moments:
h i
E ∥ΣZ − ZZ ⊤ ∥4 ≤ Cz dϑz 5 , E ∥ΣZY − ZY ∥4 ≤ Czy dϑz 6 , ϑ := max{ϑ5 , ϑ6 }.
 
(15)

Assumption 3.2. There exist constants 0 < µZ ≤ λZ < ∞ such that µZ Idz ⪯ ΣZ ⪯ λZ Idz .
The above conditions are rather mild moment conditions, similar to Assumption 2.3, and could be easily
verified for the linear model setting we consider.
Assumption 3.3. {γt }t is within a compact set of diameter Cγ dκz for some constants Cγ > 0, κ ≥ 0.
We emphasize that Assumption 3.3 is only for the uncoupled sequence γt , which is an SGD sequence
for solving a strongly-convex problem. It holds easily in various cases, for example by projecting the
iterates onto any compact sets or a sufficiently large ball containing γ ∗ . It is also well-known that, without
any projection operations, {γt }t sequence is almost surely bounded Polyak and Juditsky (1992) under our
assumptions. Finally, similar assumptions routinely appear in the analysis of SGD algorithms in various
related settings; see, for example, Tseng (1998); Gurbuzbalaban et al. (2019); Haochen and Sra (2019);
Nagaraj et al. (2019); Ahn et al. (2020); Rajput et al. (2020).
We now present our result on the convergence of {θt }t below in Theorem 2 (see Appendix E.1 for the
proof). In comparison to Theorem 1 (regarding Algorithm 1), we highlight that Theorem 2 provides an
any-time guarantee, as the total number of iterations is not required in advance by Algorithm 2.
Theorem 2. Suppose Assumptions 2.2, 2.4 (without Xt′ ), 3.1, 3.3, and 3.2 hold. In Algorithm 2, for any
−4κ−ϑ/2 −1 −2
ι > 0, set αt = Cα t−1+ι/2 and βt = Cβ t−1+ι/2 , where Cα = min{0.5dz λZ Cγ , 0.5(∥γ∗ ∥λZ )−2 },
and Cβ = µ2 d−1−2κ
z /128. Then, we have
 
1
E ∥θt − θ∗ ∥2 = O 1−ι .
 
t
Remark 1. In Theorem 2, we present the step-size choices for the fastest rate of convergence. In the proof
of Theorem 2 (see Appendix E.1), we show that convergence can be guaranteed for a range of step-sizes
given by αt = Cα t−a , βt = Cβ t−b , where 1/2 < a, b < 1, b > 2 − 2a with corresponding rate being
−1
E ∥θt − θ∗ ∥2 = O(max{t−b(2−(1−ι/2) ) , t−a log(2/ι − 1)}). In particular, one requires a, b < 1 to
 

ensure (αt − αt+1 )/αt = o(αt ), and (βt − βt+1 )/βt = o(βt ), as is standard in stochastic approximation
literature (see, for example, Chen et al. (2020); Polyak and Juditsky (1992)).
Proof Techniques. The major challenge towards the convergence analysis of {θt }t lies in the interac-
tion term γt Zt Zt⊤ γt θt between γt and θt in (13). This multiplicative interaction term leads to an involved
dependence between the noise in the stochastic gradient updates for the two stages. Such a dependence
has not been considered in existing analysis of non-linear two time-scale algorithms (Mokkadem and Pel-
letier, 2006; Maei et al., 2009; Dalal et al., 2018; Doan and Romberg, 2020; Xu and Liang, 2021; Wang
et al., 2021; Doan, 2022). In addition, Doan (2022) considers the case when the noise sequence is not only
independent of each other but also independent of iterate locations. Furthermore, they assumes (see their
Assumption 3) that the condition in Assumption 2.2 holds for all γ whereas Assumption 2.2 only needs to
hold for γ∗ , that is much milder. Similarly, many works (for example, Assumption 1 in Wang et al. (2021),
Assumption 2 in Xu and Liang (2021) and Theorem 2 in Maei et al. (2009)) assume that the iterates of both
stages are bounded in a compact set and consequently, and hence the variance of the stochastic gradients are
also uniformly bounded.
In our setting, firstly, the stochastic gradient in (13), evaluated at (θt , γt ) is biased:
h i h i h i
Et,Zt γt ⊤ Zt (Zt⊤ γt θt − Yt ) = Et,Zt γt ⊤ Zt (Zt⊤ γt θt − Zt⊤ γ∗ θ∗ ) = Et γt ⊤ ΣZ (γt θt − γ∗ θ∗ )
=γt ⊤ ΣZ γt (θt − θ∗ ) + γt ⊤ ΣZ (γt − γ∗ )θ∗ ̸= γ∗⊤ ΣZ γ∗ (θt − θ∗ ) = ∇θ F (θt ).

9
10 10 10 10
TOSG-IVaR TOSG-IVaR TOSG-IVaR TOSG-IVaR

8 8 8 8

6 6 6 6
* 2

* 2

* 2

* 2
4 4 4 4

2 2 2 2

0 0 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
t t t t

(a) (b) (c) (d)

10 10 10 10
TOSG-IVaR TOSG-IVaR TOSG-IVaR TOSG-IVaR

8 8 8 8

6 6 6 6
* 2

* 2

* 2

* 2
4 4 4 4

2 2 2 2

0 0 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
t t t t

(e) (f) (g) (h)

Figure 2: E[∥θt − θ∗ ∥2 ] of Algorithm 1 under different settings detailed in Section 4.

Furthermore, even under Assumption 3.3, the variance of the stochastic gradient is not (13) uniformly
bounded. Overcoming these issues, in addition to the aforementioned dependence between the noise in
the stochastic gradient updates for the two stages, forms the major novelty in our analysis. We proceed
by noting that if γ∗ , ΣZ , and
 ΣZY were known  beforehand, one conduct deterministic gradient updates,
e e ⊤
i.e., θt+1 = θt − αt+1 γ∗ ΣZ γ∗ θt − ΣZY , to obtain θ∗ . By standard results on gradient descent for
e
strongly convex functions (see, for example, Nesterov (2013)), {θet }t converges exponentially fast as stated
in Lemma 4. Hence, it remains to show that the trajectory of θt converges to the trajectory of θt . That is,
e
2

defining the sequence δt := θt − θet , our goal is to establish the convergence rate of E ∥δt ∥ . We first pro- 2
vide an intermediate bound (see Lemma 6) andthen progressively sharpen to a tighter bound (see Lemma 7).
In doing so, it is also required to show that E ∥θt ∥4 is bounded, which weP

prove in Lemma√ 5. The proof
of Lemma 5 is non-trivial and requires carefully chosen stepsizes satisfying ∞ (α
t=1 t
2+α
t βt ) < ∞.

4 Numerical Experiments
Experiments for Algorithm 1 (TOSG-IVaR). We first consider the following problem, in which (Z, X, Y )
is generated via

Z ∼ N (0, Idz ), X = ϕ(γ∗⊤ Z) + c · (h + ϵx ), Y = θ∗⊤ X + c · (h1 + ϵy ),

where c > 0 is a scalar to control the variance of the noise vector, and h1 is the first coordinate of h. The
noise vectors (or scalar) h, ϵx , ϵy are independent of Z, and we have h ∼ N (1dx , Idx ), ϵx ∼ N (0, Idx ),ϵy ∼
N (0, 1). In each iteration, one tuple (X, X ′ , Y ) is generated and used to update θt according to Algorithm
1. We set (dx , dz ) ∈ {(4, 8), (8, 16)}, c ∈ {0.1, 1.0}, and ϕ(s) ∈ {s, s2 }. We repeat each setting 50 times
and report the curves of E[∥θt − θ∗ ∥2 ] in Figure 2, where the expectation is computed as the average of
∥θt − θ∗ ∥2 of all trials, and the shaded region represents the standard deviation. The first row and the second
row correspond to ϕ(s) = s and ϕ(s) = s2 respectively. Here, c = 0.1 for odd columns and c = 1.0 for
even columns. We have (dx , dz ) = (4, 8) for the first two columns and (dx , dz ) = (8, 16) for the last two
columns. Empirically, we can observe that our Algorithm 1 performs well across all different settings.

10
1 OTSG-IVaR OTSG-IVaR OTSG-IVaR OTSG-IVaR
CSO -- Eq. (11) 2 CSO -- Eq. (11) 1 CSO -- Eq. (11) 1 CSO -- Eq. (11)
[DVB23] [DVB23] [DVB23] [DVB23]
0 1 0 0

1 0 1 1
* 2)

* 2)

* 2)

* 2)
1
2 2 2
2
log(

log(

log(

log(
3 3 3
3
4 4 4 4

5 5 5 5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
log (t) log (t) log (t) log (t)

(a) (b) (c) (d)

3 OTSG-IVaR OTSG-IVaR OTSG-IVaR 4 OTSG-IVaR


CSO -- Eq. (11) CSO -- Eq. (11) CSO -- Eq. (11) CSO -- Eq. (11)
2 [DVB23] 2 [DVB23] [DVB23] [DVB23]
2
1 2

0 0
* 2)

* 2)

* 2)

* 2)
0
0
1
2
log(

log(

log(

log(
2 2 2
3
4 4 4 4
5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
log (t) log (t) log (t) log (t)

(e) (f) (g) (h)

Figure 3: Comparison of E[∥θt − θ∗ ∥2 ] (log-log scale) for Algorithm 2, Eq. 11 and Della Vecchia and Basu
(2024).

Experiments for Algorithm 2 (OTSG-IVaR). Next, we compare our Algorithm 2 as well as its variant and
Algorithm 1 in Della Vecchia and Basu (2024). We write “OTSG-IVaR”, “CSO – Eq. (11)” and “[DVB23]”
to represent Algorithm 2, Algorithm 2 with the updates replaced by (11) and Algorithm 1 in Della Vecchia
and Basu (2024) (see Appendix A). We follow simulation settings similar to Della Vecchia and Basu (2024):

Y = θ∗⊤ X + ν, X = γ∗⊤ Z + ϵ, ϵ = σϵ N (0, Idx ), ν = ρϵ1 + N (0, 0.25), (16)

where ϵ1 is the first coordinate of ϵ, θ∗ ∈ Rdx is a unit vector chosen uniformly randomly, and γ∗ ∈ Rdz ×dx
where γij = 0 for i ̸= j, and γij = 1 for i = j, i = 1, 2, · · · , dx , and j = 1, 2, · · · , dz . Here ρ controls
the level of endogeneity in the model. We compare the performance of Algorithm 2 with (11), and O2SLS
(Della Vecchia and Basu, 2024) for ρ = 1, 4, and σϵ = 0.5, 1. By varying σϵ we control the correlation
between X and Z. We consider  (dx , dz ) = (1, 1), and (dx , dz ) = (8, 16). As performance
 two settings
2
metric, in Figure 3 we plot E ∥θt − θ∗ ∥ where the E [·] is approximated by averaging over 50 trials,
and both axes are in log scale (base 10). We also show, in Figure 4, the convergence of the test Mean
Squared Error (MSE) evaluated over 400 test samples to the best possible Test MSE where θ∗ and γ∗ are
known beforehand. For Figures 3 and 4, the first row and second row corresponds to (dx , dz ) = (1, 1) and
(dx , dz ) = (8, 16) respectively, and σϵ = 0.5 in odd columns and σϵ = 1.0 in even columns. We have
ρ = 1.0 for the first two columns and ρ = 4.0 for the last two columns. We can observe that O2SLS has
much larger variance in different settings, while our algorithms perform consistently well in all settings.

5 Conclusion
We presented streaming algorithms for least-squares IVaR based on directly solving the associated condi-
tional stochastic optimization formulation in (2). Our algorithms have several benefits, including avoidance
of mini-batches and matrix inverses. We show that the expected rates of convergences for the proposed
algorithms are of order O(log T /T ) and O(1/T 1−ι ), for any ι > 0, under the availability of two-sample
and one-sample oracles, respectively. As future work, it is interesting to develop streaming inferential meth-
ods for IVaR. Leveraging related works for the vanilla SGD (Polyak and Juditsky, 1992; Anastasiou et al.,
2019; Shao and Zhang, 2022; Chen et al., 2020; Zhu et al., 2023) to the setting of Algorithms 1 and 2, pro-

11
9 8
OTSG-IVaR OTSG-IVaR OTSG-IVaR OTSG-IVaR
CSO -- Eq. (11) 8 CSO -- Eq. (11) 6 CSO -- Eq. (11) CSO -- Eq. (11)
5 [DVB23] [DVB23] [DVB23] [DVB23]
Offline Offline Offline 7 Offline
7
4 5
6 6
log(Test MSE)

log(Test MSE)

log(Test MSE)

log(Test MSE)
3 5 4
5
2 4
3
3 4
1
2 2
3
0 1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
log (t) log (t) log (t) log (t)

(a) (b) (c) (d)

9 9
OTSG-IVaR OTSG-IVaR OTSG-IVaR OTSG-IVaR
8 CSO -- Eq. (11) 7 CSO -- Eq. (11) CSO -- Eq. (11) CSO -- Eq. (11)
[DVB23] [DVB23] 8 [DVB23] 8 [DVB23]
Offline Offline Offline Offline
6 7
6 7
5 6
log(Test MSE)

log(Test MSE)

log(Test MSE)

log(Test MSE)
6
4 4 5
5
4
3
2 3 4
2
2 3
0 1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
log (t) log (t) log (t) log (t)

(e) (f) (g) (h)

Figure 4: Comparison of Test MSE (log-log scale) for Algorithm 2, Eq. 11 and Della Vecchia and Basu
(2024).

vides a concrete direction to establish Central Limit Theorems and develop limiting covariance estimation
procedures.

References
K. Ahn, C. Yun, and S. Sra. SGD with shuffling: optimal rates without component convexity and large
epoch requirements. Advances in Neural Information Processing Systems, 33:17526–17535, 2020. (Cited
on page 9.)

A. Anastasiou, K. Balasubramanian, and M. A. Erdogdu. Normal approximation for stochastic gradient


descent via non-asymptotic rates of martingale clt. In Conference on Learning Theory, pages 115–137.
PMLR, 2019. (Cited on page 11.)

J. D. Angrist and G. W. Imbens. Two-stage least squares estimation of average causal effects in models with
variable treatment intensity. Journal of the American statistical Association, 90(430):431–442, 1995.
(Cited on pages 2 and 8.)

J. D. Angrist and J.-S. Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton
university press, 2009. (Cited on pages 2, 3, and 4.)

A. Babii and J.-P. Florens. Is completeness necessary? estimation in nonidentified linear models. arXiv
preprint arXiv:1709.03473, 2017. (Cited on page 3.)

K. Balasubramanian, S. Ghadimi, and A. Nguyen. Stochastic multilevel composition optimization algo-


rithms with level-independent convergence rates. SIAM Journal on Optimization, 32(2):519–544, 2022.
(Cited on page 3.)

A. Bennett, N. Kallus, and T. Schnabel. Deep generalized method of moments for instrumental variable
analysis. Advances in neural information processing systems, 32, 2019. (Cited on pages 2, 3, and 4.)

12
A. Bennett, N. Kallus, X. Mao, W. Newey, V. Syrgkanis, and M. Uehara. Minimax instrumental vari-
able regression and l 2 convergence guarantees without identification or closedness. arXiv preprint
arXiv:2302.05404, 2023. (Cited on pages 2, 3, and 4.)

M. Carrasco, J.-P. Florens, and E. Renault. Linear inverse problems in structural econometrics estimation
based on spectral decomposition and regularization. Handbook of econometrics, 6:5633–5751, 2007.
(Cited on page 3.)

S. Centorrino and J.-P. Florens. Nonparametric instrumental variable estimation of binary response models
with continuous endogenous regressors. Econometrics and Statistics, 17:35–63, 2021. (Cited on page 6.)

T. Chen, Y. Sun, and W. Yin. Solving stochastic compositional optimization is nearly as easy as solving
stochastic optimization. IEEE Transactions on Signal Processing, 69:4937–4948, 2021. (Cited on page 3.)

X. Chen and D. Pouzo. Estimation of nonparametric conditional moment models with possibly nonsmooth
generalized residuals. Econometrica, 80(1):277–321, 2012. (Cited on pages 3 and 4.)

X. Chen and M. Reiss. On rate optimality for ill-posed inverse problems in econometrics. Econometric
Theory, 27(3):497–521, 2011. (Cited on page 3.)

X. Chen, J. D. Lee, X. T. Tong, and Y. Zhang. Statistical inference for model parameters in stochastic
gradient descent. Annals of Statistics, 48(1):251–273, 2020. (Cited on pages 9, 11, and 21.)

X. Chen, S. Lee, Y. Liao, M. H. Seo, Y. Shin, and M. Song. SGMM: Stochastic approximation to generalized
method of moments. arXiv preprint arXiv:2308.13564, 2023. (Cited on page 4.)

Y. Cui, H. Pu, X. Shi, W. Miao, and E. Tchetgen Tchetgen. Semiparametric proximal causal inference.
Journal of the American Statistical Association, pages 1–12, 2023. (Cited on page 3.)

B. Dai, N. He, Y. Pan, B. Boots, and L. Song. Learning from conditional distributions via dual embeddings.
In Artificial Intelligence and Statistics, pages 1458–1467. PMLR, 2017. (Cited on page 2.)

G. Dalal, G. Thoppe, B. Szörényi, and S. Mannor. Finite sample analysis of two-timescale stochastic
approximation with applications to reinforcement learning. In Conference On Learning Theory, pages
1199–1233. PMLR, 2018. (Cited on pages 8 and 9.)

S. Darolles, Y. Fan, J.-P. Florens, and E. Renault. Nonparametric instrumental regression. Econometrica,
79(5):1541–1565, 2011. (Cited on page 2.)

R. Della Vecchia and D. Basu. Online instrumental variable regression: Regret analysis and bandit feedback.
arXiv preprint arXiv:2302.09357v1, 2023. (Cited on pages 17 and 18.)

R. Della Vecchia and D. Basu. Stochastic online instrumental variable regression: Regrets for endogeneity
and bandit feedback. arXiv preprint arXiv:2302.09357v3, 2024. (Cited on pages 4, 11, 12, 17, and 18.)

N. Dikkala, G. Lewis, L. Mackey, and V. Syrgkanis. Minimax estimation of conditional moment models.
Advances in Neural Information Processing Systems, 33:12248–12262, 2020. (Cited on pages 2, 3, and 4.)

T. Doan and J. Romberg. Finite-time performance of distributed two-time-scale stochastic approximation.


In Learning for Dynamics and Control, pages 26–36. PMLR, 2020. (Cited on pages 8 and 9.)

T. T. Doan. Nonlinear two-time-scale stochastic approximation convergence and finite-time performance.


IEEE Transactions on Automatic Control, 2022. (Cited on page 9.)

13
J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan. Ergodic mirror descent. SIAM Journal on
Optimization, 22(4):1549–1578, 2012. (Cited on page 5.)

Y. M. Ermoliev and V. I. Norkin. Sample average approximation method for compound stochastic optimiza-
tion problems. SIAM Journal on Optimization, 23(4):2231–2263, 2013. (Cited on page 3.)

M. Even. Stochastic gradient descent under Markovian sampling schemes. In International Conference on
Machine Learning, pages 9412–9439. PMLR, 2023. (Cited on page 5.)

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.
SIAM journal on optimization, 23(4):2341–2368, 2013. (Cited on page 7.)

S. Ghadimi, A. Ruszczynski, and M. Wang. A single timescale stochastic approximation method for nested
stochastic optimization. SIAM Journal on Optimization, 30(1):960–979, 2020. (Cited on page 3.)

M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo. Convergence rate of incremental gradient and incremental
newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019. (Cited on page 9.)

P. Hall and J. L. Horowitz. Nonparametric methods for inference in the presence of instrumental variables.
Annals of statistics, 33(6):2904–2929, 2005. (Cited on page 2.)

J. Haochen and S. Sra. Random shuffling beats SGD after finite epochs. In International Conference on
Machine Learning, pages 2624–2633. PMLR, 2019. (Cited on page 9.)

J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy. Deep iv: A flexible approach for counterfactual
prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
1414–1423. JMLR, 2017. (Cited on pages 2 and 3.)

Y. Hu, X. Chen, and N. He. Sample complexity of sample average approximation for conditional stochastic
optimization. SIAM Journal on Optimization, 30(3):2103–2133, 2020a. (Cited on page 3.)

Y. Hu, S. Zhang, X. Chen, and N. He. Biased stochastic first-order methods for conditional stochastic
optimization and applications in meta learning. Advances in Neural Information Processing Systems, 33:
2759–2770, 2020b. (Cited on pages 2 and 3.)

Y. Hu, X. Chen, and N. He. On the bias-variance-cost tradeoff of stochastic optimization. Advances in
Neural Information Processing Systems, 34:22119–22131, 2021. (Cited on page 6.)

Y. Hu, J. Wang, Y. Xie, A. Krause, and D. Kuhn. Contextual stochastic bilevel optimization. Advances in
Neural Information Processing Systems, 36, 2024. (Cited on page 3.)

A. Khaled and P. Richtárik. Better theory for SGD in the nonconvex world. arXiv preprint
arXiv:2002.03329, 2020. (Cited on page 6.)

G. Lan. First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020.
(Cited on page 6.)

G. Lewis and V. Syrgkanis. Adversarial generalized method of moments. arXiv preprint arXiv:1803.07164,
2018. (Cited on pages 2 and 4.)

L. Liao, Y.-L. Chen, Z. Yang, B. Dai, M. Kolar, and Z. Wang. Provably efficient neural estimation of struc-
tural equation models: An adversarial approach. Advances in Neural Information Processing Systems,
33:8947–8958, 2020. (Cited on pages 2, 3, and 4.)

14
H. Maei, C. Szepesvari, S. Bhatnagar, D. Precup, D. Silver, and R. S. Sutton. Convergent temporal-
difference learning with arbitrary smooth function approximation. Advances in neural information pro-
cessing systems, 22, 2009. (Cited on page 9.)

A. Mastouri, Y. Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet. Proximal
causal learning with kernels: Two-stage estimation and moment restriction. In International conference
on machine learning, pages 7512–7523. PMLR, 2021. (Cited on page 3.)

A. Mokkadem and M. Pelletier. Convergence rate and averaging of nonlinear two-time-scale stochastic
approximation algorithms. Annals of Applied Probability, 16(3):1671–1702, 2006. (Cited on pages 8 and 9.)

K. Muandet, A. Mehrjou, S. K. Lee, and A. Raj. Dual instrumental variable regression. Advances in Neural
Information Processing Systems, 33:2710–2721, 2020. (Cited on pages 2, 3, 4, and 6.)

D. Nagaraj, P. Jain, and P. Netrapalli. SGD without replacement: Sharper rates for general smooth convex
functions. In International Conference on Machine Learning, pages 4703–4711. PMLR, 2019. (Cited on
page 9.)

Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science &
Business Media, 2013. (Cited on page 10.)

W. K. Newey and J. L. Powell. Instrumental variable estimation of nonparametric models. Econometrica,


71(5):1565–1578, 2003. (Cited on page 4.)

C. H. Papadimitriou. Computational complexity. In Encyclopedia of computer science, pages 260–265.


2003. (Cited on page 17.)

C. Peixoto, Y. Saporito, and Y. Fonseca. Nonparametric instrumental variable regression through stochastic
approximate gradients. arXiv preprint arXiv:2402.05639, 2024. (Cited on page 3.)

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on


control and optimization, 30(4):838–855, 1992. (Cited on pages 9, 11, and 29.)

S. Rajput, A. Gupta, and D. Papailiopoulos. Closing the convergence gap of SGD without replacement. In
International Conference on Machine Learning, pages 7964–7973. PMLR, 2020. (Cited on page 9.)

O. Reiersøl. Confluence analysis by means of instrumental sets of variables. PhD thesis, Almqvist &
Wiksell, 1945. (Cited on page 3.)

A. Roy, K. Balasubramanian, and S. Ghadimi. Constrained stochastic nonconvex optimization with state-
dependent Markov data. Advances in Neural Information Processing Systems, 35:23256–23270, 2022.
(Cited on page 5.)

A. Ruszczynski. A stochastic subgradient method for nonsmooth nonconvex multilevel composition opti-
mization. SIAM Journal on Control and Optimization, 59(3):2301–2320, 2021. (Cited on page 3.)

S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in
Machine Learning, 4(2):107–194, 2012. (Cited on page 4.)

Q.-M. Shao and Z.-S. Zhang. Berry–esseen bounds for multivariate nonlinear statistics with applications
to m-estimators and stochastic gradient descent algorithms. Bernoulli, 28(3):1548–1576, 2022. (Cited on
page 11.)

15
R. Singh, M. Sahani, and A. Gretton. Kernel instrumental variable regression. Advances in Neural Infor-
mation Processing Systems, 32, 2019. (Cited on page 3.)

T. Sun, Y. Sun, and W. Yin. On Markov chain gradient descent. Advances in neural information processing
systems, 31, 2018. (Cited on page 5.)

P. Tseng. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule.
SIAM Journal on Optimization, 8(2):506–531, 1998. (Cited on page 9.)

A. Venkatraman, W. Sun, M. Hebert, J. Bagnell, and B. Boots. Online instrumental variable regression with
applications to online linear system identification. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 30, 2016. (Cited on page 4.)

M. Wang, E. X. Fang, and B. Liu. Stochastic compositional gradient descent: Algorithms for minimizing
compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017. (Cited
on page 3.)

Y. Wang, S. Zou, and Y. Zhou. Non-asymptotic analysis for two time-scale tdc with general smooth function
approximation. Advances in Neural Information Processing Systems, 34:9747–9758, 2021. (Cited on
page 9.)

P. G. Wright. The tariff on animal and vegetable oils. Number 26. Macmillan, 1928. (Cited on page 3.)

L. Xu, Y. Chen, S. Srinivasan, N. de Freitas, A. Doucet, and A. Gretton. Learning deep features in in-
strumental variable regression. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=sy4Kg_ZQmS7. (Cited on page 3.)

T. Xu and Y. Liang. Sample complexity bounds for two timescale value-based reinforcement learning
algorithms. In International Conference on Artificial Intelligence and Statistics, pages 811–819. PMLR,
2021. (Cited on page 9.)

J. Zhang and L. Xiao. Multilevel composite stochastic optimization via nested variance reduction. SIAM
Journal on Optimization, 31(2):1131–1157, 2021. (Cited on page 3.)

W. Zhu, X. Chen, and W. B. Wu. Online covariance matrix estimation in stochastic gradient descent. Journal
of the American Statistical Association, 118(541):393–404, 2023. (Cited on page 11.)

Y. Zhu, L. Gultchin, A. Gretton, M. J. Kusner, and R. Silva. Causal inference with treatment measurement
error: a nonparametric instrumental variable approach. In Uncertainty in Artificial Intelligence, pages
2414–2424. PMLR, 2022. (Cited on page 3.)

16
A Online updates of Della Vecchia and Basu (2024)
For the sake of clarity, we present the O2SLS algorithm proposed in (Della Vecchia and Basu, 2024, v3)1 in
the streaming format, without any explicit matrix inversions that we used in our experiments:

θt+1 = (I − Ut γt ⊤ Zt Zt⊤ γt )θt + Ut γt ⊤ Zt Yt


γt+1 = (I − Vt Zt Zt⊤ )γt + Vt Zt Xt⊤
Ut γt ⊤ Zt Zt⊤ γt Ut
Ut+1 = Ut −
1 + Zt⊤ γt Ut γt ⊤ Zt
Vt Zt Zt⊤ Vt
Vt+1 = Vt − V0 = λ−1 Idz ,
1 + Zt⊤ Vt Zt

are two additional matrix sequences which tracks the matrix inverse of ti=1 γi⊤ Zi Zi⊤ γi ,
P
where Ut , Vt P
and (λIdz + ti=1 Zt Zt⊤ ) respectively for a user defined parameter λ. As mentioned in Della Vecchia
and Basu (2024), we choose λ = 0.1. The major difference between O2SLS and Algorithm 2 is that
O2SLS takes an online two-stage regression approach to minimize a suitably defined regret whereas we
take a conditional stochastic optimization point of view which requires carefully chosen step-sizes. In our
Algorithm 2, we do not need to explicitly or implicitly do matrix inverse which can
P potentially cause stability
issues. Furthermore, unlike Della Vecchia and Basu (2024), we neither assume ti=1 Zi Zi⊤ is invertible for
all t nor do we assume that Z is a bounded random variable for our analysis. Finally, the per-iteration
computational complexity and memory requirement of Algorithm 2 is significantly better than O2SLS; see
Section B.

B Per-iteration Complexities
For the linear case, i.e., the underlying relationship between Z and X as well as X and Y are linear,
Table 1 summarizes the per-iteration memory costs and number of arithmetic operations of the original
O2SLS (Della Vecchia and Basu, 2023), the updated O2SLS (Della Vecchia and Basu, 2024) that we provide
a matrix form update in Appendix A, TOSG-IVaR (Alg 1), and OTSG-IVaR (Alg 2) at the t-th iteration.
Notice that the original version of O2SLS (Della Vecchia and Basu, 2023) has a per-iteration and mem-
ory cost dependent on the iteration number t as it needs to use all the samples accumulated till the iteration
t to conduct an offline 2SLS at each iteration. The updated O2SLS (Della Vecchia and Basu, 2024) (the
algorithm that we compare to) uses samples obtained at iteration t to perform the update. Although the up-
dated O2SLS avoids explicit matrix inversion, it is obvious that its arithmetic operations and memory cost
per iteration are larger than our TOSG-IVaR and OTSG-IVaR.
We highlight that the TOSG-IVaR, which uses two samples X and X ′ from the conditional distribution
P(X | Z), requires only O(dx ) memory and arithmetic operations at each iteration.
For a fair comparison, we assume that two n × n matrices multiplication admits an O(n3 ) complexity,
i.e., using normal textbook matrix multiplication. We also assume computing the inversion of a n × n matrix
admits an O(n3 ) complexity. Interested readers may refer to (Papadimitriou, 2003) for more details about
faster algorithms with better complexities for matrix operations.
1
Note that the streaming algorithm was not present in version 1, i.e., (Della Vecchia and Basu, 2023, v1).

17
Table 1: Memory cost and the number of arithmetic operations at iteration t.
Algorithm Memory cost Arithmetic Operations
O2SLS (Della Vecchia and Basu, 2023, v1) t(dx + dz ) + dz dx + dx O(d3x + td2x + tdx dz )
O2SLS (Della Vecchia and Basu, 2024, v3) (Sec. A) d2x + d2z + dz dx + dx O(d2x + d2z + dz dx )
TOSG-IVaR (our Alg 1) dx O(dx )
OTSG-IVaR (our Alg 2) dx dz + dx O(min(d2x , d2z ) + dz dx )

C Experimental Details
C.1 Compute Resources
All experiments in Section 4 were conducted on a computer with an 11th Intel(R) Core(TM) i7-11370H
CPU. The time and space required to run our experiments are negligible and we anticipate they can be
conducted in almost all computers.

C.2 Experimental Details for Figure 1


In Figure 1, we show an example where the updates (11) may diverge first before converging eventually and
finite time performance can be much worse compared to Algorithm 2. For this experiment, we choose the
model presented in (16) with dx = dz = 1, θ∗ = 1, γ∗ = −1, ρ = 4, and σϵ = 1. When initialized at
γ0 = 10, and θ0 = 0, the updates in (11) keeps diverging rapidly at first whereas Algorithm 2 is much more
stable. So, by the end of 100, 000 iterations, while Algorithm 2 achieves an error of ≈ 10−5 , (11) achieves
≈ 104 that is worse than it was at initialization because (11) has not recovered from the initial divergence
phase yet. However, once (11) starts converging, the convergence rate of (11) is similar to Algorithm 2 as
one can see from Figure 1 (also see our discussion on the convergence of (11) in Section E.3).

D Proofs for Section 2


D.1 Proof of Theorem 1
Proof. We aim to find the optimal θ∗ . According to (2), we know
h i h i
EZ EX|Z [X] · EX|Z [X]⊤ θ∗ = EZ EY |Z [Y ] · EX|Z [X] (17)

The updates in Algorithm 1 can be written as

θt+1 = θt − αt+1 (Xt⊤ θt − Yt )Xt⊤ .

Hence we have

θt+1 − θ∗
h i h i
=θt − αt+1 EZ EX|Z [X] · EX|Z [X]⊤ θt + αt+1 EZ EY |Z [Y ] · EX|Z [X] − θ∗
 h i  h i
− αt+1 Xt′ Xt⊤ − EZ EX|Z [X] · EX|Z [X]⊤ θt + αt+1 Yt Xt′ − EZ EY |Z [Y ] · EX|Z [X] . (18)

18
Now we analyze the convergence and variance separately. For the convergence part, we have
h i h i
θt − αt+1 EZ EX|Z [X] · EX|Z [X]⊤ θt + αt+1 EZ EY |Z [Y ] · EX|Z [X] − θ∗
 h i
= I − αt+1 EZ EX|Z [X] · EX|Z [X]⊤ (θt − θ∗ ). (19)

For the variance part we have


h h i i
E ∥Xt′ Xt⊤ − EZ EX|Z [X] · EX|Z [X]⊤ ∥2
h h i i
=E ∥Xt′ Xt⊤ − EX|Zt [X] · EX|Zt [X]⊤ + EX|Zt [X] · EX|Zt [X]⊤ − EZ EX|Z [X] · EX|Z [X]⊤ ∥2
h h i i
≤2E ∥Xt′ Xt⊤ − EX|Zt [X] · EX|Zt [X]⊤ ∥2 + ∥EX|Zt [X] · EX|Zt [X]⊤ − EZ EX|Z [X] · EX|Z [X]⊤ ∥2
≤2Cx dϑx1 + 2Cxx dϑz 3 =: σ12 (20)
Similarly, we have
h h i i
E ∥Yt Xt′ − EZ EY |Z [Y ] · EX|Z [X] ∥2
h h i i
=E ∥Yt Xt′ − EY |Zt [Y ] · EX|Zt [X]∥2 + ∥EY |Zt [Y ] · EX|Zt [X] − EZ EY |Z [Y ] · EX|Z [X] ∥2
≤Cy dxϑ2 + Cyx dϑz 4 =: σ22 . (21)
Now we know from (18), (19), (20), and (21) that
∥θt+1 − θ∗ ∥2 = ∥At ∥2 + 2αt+1 ⟨At , Bt ⟩ + αt+1
2
∥Bt ∥2 . (22)
where
 h i
At = I − αt+1 EZ EX|Z [X] · EX|Z [X]⊤ (θt − θ∗ )
 h i  h i
Bt = − Xt′ Xt⊤ − EZ EX|Z [X] · EX|Z [X]⊤ θt + Yt Xt′ − EZ EY |Z [Y ] · EX|Z [X] .

This implies
h i
Eθt+1 |θt ∥θt+1 − θ∗ ∥2
 h i
=∥ I − αt+1 EZ EX|Z [X] · EX|Z [X]⊤ (θt − θ∗ )∥2
h  h i  h i i
2
+ αt+1 EXt ,Xt′ ,Yt ,Zt |θt ∥ Xt′ Xt⊤ − EZ EX|Z [X] · EX|Z [X]⊤ θt − Yt Xt′ − EZ EY |Z [Y ] · EX|Z [X] ∥2
 
≤(1 − αt+1 µ)2 ∥θt − θ∗ ∥2 + 3αt+1 2
σ12 ∥θt − θ∗ ∥2 + σ12 ∥θ∗ ∥2 + σ22 ∥θ∗ ∥2
≤((1 − αt+1 µ)2 + 3αt+1
2
σ12 )∥θt − θ∗ ∥2 + 3αt+1
2
σ12 ∥θ∗ ∥2 + 3αt+1
2
σ22 ∥θ∗ ∥2 , (23)
where the first inequality uses Cauchy-Schwarz inequality, the definition of σ1 , σ2 and Assumption 2.3.
Choosing αt+1 such that
µ
((1 − αt+1 µ)2 + 3αt+1
2
σ12 ) ≤ 1 − αt+1 µ ⇔ α ≤ 2
µ + 3σ12
and taking expectation on both sides of (23), we have
h i h i
E ∥θt+1 − θ∗ ∥2 ≤ (1 − αt+1 µ)E ∥θt − θ∗ ∥2 + 3αt+1
2
σ12 ∥θ∗ ∥2 + 3αt+1
2
σ22 ∥θ∗ ∥2 .

Now, we use the following result.

19
Lemma 2. Suppose we have three sequences {at }∞ ∞ ∞
t=0 , {bt }t=0 , {rt }t=0 satisfying

at+1 ≤ rt at + bt , rt > 0 (24)


Qt
for any t ≥ 0. Define Rt+1 = i=0 ri , we have
t
X Rt+1 bi
at+1 ≤ Rt+1 a0 + .
Ri+1
i=0

By Lemma 2, we know

h i Yt h i t
X t
Y
E ∥θt+1 − θ∗ ∥2 ≤ (1 − αi µ)E ∥θ0 − θ∗ ∥2 + (3σ12 ∥θ∗ ∥2 + 3σ22 ∥θ∗ ∥2 ) αi2 (1 − αj µ).
i=0 i=0 j=i+1

Now if we set αi = α, we know


h i h i t
X 
2 t 2 2
E ∥θt − θ∗ ∥ ≤(1 − αµ) E ∥θ0 − θ∗ ∥ + α (1 − αµ) (3σ12 ∥θ∗ ∥2 + 3σ22 ∥θ∗ ∥2 )
i

i=0
h i α
−tαµ
≤e E ∥θ0 − θ∗ ∥2 + (3σ12 ∥θ∗ ∥2 + 3σ22 ∥θ∗ ∥2 )
µ
log T µ
Choosing α, T such that α = µT ≤ µ2 +3σ12
, we know
h i
h i E ∥θ0 − θ∗ ∥2 3∥θ∗ ∥2 (σ12 + σ22 ) log T
E ∥θT − θ∗ ∥2 ≤ + .
T µ2 T

D.2 Proof of Lemma 2


Proof. We notice from (24) that for any 0 ≤ i ≤ t, we have

ai+1 ai bi
≤ + .
Ri+1 Ri Ri+1

Taking summation on both sides, we have


t
at+1 a0 X bi
≤ +
Rt+1 R0 Ri+1
i=0

which completes the proof by multiplying Rt+1 on both sides.

D.3 Proof of Lemma 1


Proof. We first notice that Assumption 2.2 holds since
h i h i
EZ EX|Z [X] · EX|Z [X]⊤ = EZ ϕ(γ∗⊤ Z) · ϕ(γ∗⊤ Z)⊤ ⪰ µI.

20
For (5) and (6), we have
h i
E ∥X ′ X ⊤ − EX|Z [X]EX|Z [X]⊤ ∥2
h i
=E ∥ϵ′2 ϕ(γ∗⊤ Z)⊤ + ϕ(γ∗⊤ Z)ϵ⊤2 + ϵ′ ⊤ 2
ϵ
2 2 ∥
h i
≤3E ∥ϵ′2 ϕ(γ∗⊤ Z)⊤ ∥2 + ∥ϕ(γ∗⊤ Z)ϵ⊤2 ∥2
+ ∥ϵ′ ⊤ 2
ϵ
2 2 ∥
h i
=3E ∥ϕ(γ∗⊤ Z)ϵ′⊤ ϵ
2 2

ϕ(γ ⊤
∗ Z)⊤
∥ + ∥ϕ(γ ⊤
∗ Z)ϵ⊤
ϵ
2 2 ϕ(γ ⊤
∗ Z)⊤
∥ + |ϵ⊤ ′ 2
ϵ
2 2 | = O(d2x ), (25)

and
h i
E ∥Y X − EY |Z [Y ]EX|Z [X]∥2
h i
=E ∥X ′ X ⊤ θ∗ + ϵ1 X − EX|Z [X]EX|Z [X]⊤ θ∗ ∥2
h i h i
≤2E ∥X ′ X ⊤ θ∗ − EX|Z [X]EX|Z [X]⊤ θ∗ ∥2 + 2E ϵ21 ∥ϕ(γ∗⊤ Z) + ϵ2 ∥2
=O(∥θ∗ ∥2 σϵ22 d2x + σϵ21 dx + σϵ21 ,ϵ2 dx ),

where the first inequality uses Cauchy-Schwarz inequality, and the second equality uses (9), (10) and (25).
For (7) we have
h h i i
E ∥EX|Z [X] · EX|Z [X]⊤ − EZ EX|Z [X] · EX|Z [X]⊤ ∥2
h h i i
=E ∥ϕ(γ∗⊤ Z)ϕ(γ∗⊤ Z)⊤ − E ϕ(γ∗⊤ Z)ϕ(γ∗⊤ Z)⊤ ∥2 = O(dz )

where the last equality uses (10). Using the above conclusion in (8), we have
h h i i
E ∥EY |Z [Y ] · EX|Z [X] − EZ EY |Z [Y ] · EX|Z [X] ∥2
h h i i
=E ∥EX|Z [X] · EX|Z [X]⊤ θ∗ − EZ EX|Z [X] · EX|Z [X]⊤ θ∗ ∥2 = O(∥θ∗ ∥2 dz ).

E Proofs for Section 3


E.1 Proof of Theorem 2
Proof of Theorem 2 . Recall that ξZt , and ξZt Yt are the i.i.d. noise sequences

ξZt = ΣZ − Zt Zt⊤ ,
ξZt Yt = ΣZY − Zt Yt .

Note γ∗ , and θ∗ can be written as γ∗ = Σ−1 dz ×dx , and θ = γ ⊤ Σ γ −1 γ ⊤ Σ dx which



Z ΣZX ∈ R ∗ ∗ z ∗ ∗ ZY ∈ R
we are going to use throughout the proof.
To quantify the bias, we use the following bound on E ∥γt − γ∗ ∥k2 , k = 1, 2, 4, proved in Lemma 3.2
 

of Chen et al. (2020).


Lemma 3. Suppose Assumption 2.4, and Assumption 3.2 hold. Then we have
h i q 
k k k
E ∥γt − γ∗ ∥ = O dz βt for k = 1, 2, 4. (26)

21
We proceed by noting that if γ∗ , ΣZ , and ΣZY were known beforehand, one could use the following
deterministic gradient updates to obtain θ∗ .
 
θet+1 = θet − αt+1 γ∗⊤ ΣZ γ∗ θet − ΣZY . (27)

Lemma 4. Let Assumption 2.2 be true. Then, choosing ηk = O(k −a ) with 1/2 < a < 1, we have
∥θet − θ∗ ∥ = O exp(−t1−a ) .
Define the sequence δt := θt − θet . We will establish the convergence rate of E ∥δt ∥22 . From (13), and
 

(27), we have the following expansion of δt+1 .


δt+1 = Qt δt + αt+1 Dt θt + αt+1 (γt − γ∗ )⊤ ΣZY − αt+1 γt ⊤ ξZt Yt + αt+1 γt ⊤ ξZt γt θt , (28)
where
Qt :=(I − αt+1 γ∗⊤ ΣZ γ∗ ),
Dt :=γ∗⊤ ΣZ γ∗ − γt ⊤ ΣZ γt .
bound on E ∥δt ∥2 . To do so, we will need the following result
 
First we will establish
 an intermediate
which shows that E ∥θt − θ∗ ∥42 is bounded for all t which we prove in Section E.2.


Lemma 5 (Boundedness of fourth moment of ∥θt − θ∗ ∥). Let the conditions in Theorem 2 be true. Then,
−4κ−ϑ/2 √
, and ∞ 2 4
P  
choosing αt , βt such that αt ≤ dz t=1 (αt +αt βt ) < ∞, we have E ∥θt − θ∗ ∥2 is bounded
by some constant M > 0.
2
Lemma 6 (Intermediate bound on  E[∥δ2
 t ∥2 ]). Let the conditions in Theorem 2 be true. We have the
following intermediate bound on E ∥δt ∥ :
 p 
E ∥δt ∥2 = O βt d1+2κ 4κ+ϑ/2
 
z + α d
t+1 z + dz t .
β (29)

Proof of Lemma 6. Recall the update for δt+1 obtained in (28).


δt+1 = Qt δt + αt+1 Dt θt + αt+1 (γt − γ∗ )⊤ ΣZY − αt+1 γt ⊤ ξZt Yt + αt+1 γt ⊤ ξZt γt θt .
Then,
∥δt+1 ∥22 =δt ⊤ Q2t δt + αt+1
2
∥Dt θt + (γt − γ∗ )⊤ ΣZY − γt ⊤ ξZt Yt + γt ⊤ ξZt γt θt ∥2
 
+ 2αt+1 δt ⊤ Qt Dt θt + (γt − γ∗ )⊤ ΣZY (30)
 
+ 2αt+1 δt ⊤ Qt γt ⊤ ξZt γt θt − γt ⊤ ξZt Yt .

Then, choosing α1 (∥γ∗ ∥2 λZ )2 < 1, using Young’s inequality and Assumption 2.4, from (30) we get,
 
Et ∥δt+1 ∥22 ≤(1 − αt+1 µ)∥δt ∥2 + 4αt+1 2
∥Dt θt ∥2 + ∥(γt − γ∗ )⊤ ΣZY ∥2
 

2
∥γt ∥22 E ∥ξZt Yt ∥22 + ∥γt ∥42 E ∥ξZt ∥2 ∥θt ∥2
    
+ 4αt+1
 
+ 2αt+1 δt ⊤ Qt Dt θt + (γt − γ∗ )⊤ ΣZY
 
≲(1 − αt+1 µ)∥δt ∥2 + 4αt+1 2
∥Dt ∥2 ∥θt ∥2 + ∥(γt − γ∗ )⊤ ΣZY ∥2
 
2
+ 4Cαt+1 d2κ+ϑ/2
z + d4κ+ϑ/2
z ∥θt ∥2 +
 
2αt+1 δt ⊤ Qt Dt θt + (γt − γ∗ )⊤ ΣZY ,

22
where the last inequality follows by Assumption 3.1,and Assumption 3.3.
Now, taking expectation on both sides, we obtain
  h i
E ∥δt+1 ∥22 ≲(1 − αt+1 µ)E ∥δt ∥2 + 4αt+1 2
E ∥Dt ∥2 ∥θt ∥2 + E ∥(γt − γ∗ )⊤ ΣZY ∥2
    
 
2
d2κ+ϑ/2 4κ+ϑ/2 2

+ 4Cαt+1 z + dz E ∥θ t ∥ (31)
 h i h i
+ 2αt+1 E |δt ⊤ Qt Dt θt | + E |δt ⊤ Qt (γt − γ∗ )⊤ ΣZY | .

Now, the following bounds are true:


1. We have that
q 
2 2 2 2
E ∥Dt ∥42 E ∥θt ∥42 ≲ d1+2κ 2
    
αt+1 E ∥Dt ∥ ∥θt ∥ ≤ αt+1 z αt+1 βt , (32)
where the first inequality follows by Cauchy-Schwarz inequality, the second inequality follows by (42),
and Lemma 5.
2. Using ΣZY = O(1), and Lemma 3, we get
h i
2
αt+1 E ∥(γt − γ∗ )⊤ ΣZY ∥2 ≲ dz βt αt+1
2
. (33)

3. We have that
h i
αt+1 E |δt ⊤ Qt Dt θt | ≤αt+1 E [∥δt ∥2 ∥Qt ∥2 ∥Dt ∥2 ∥θt ∥2 ]
αt+1 µ  2
 4αt+1 q 
E ∥Dt ∥42 E ∥θt ∥42
  
≤ E ∥δt ∥ +
16 µ
αt+1 µ  4d1+2κ α
t+1 βt
E ∥δt ∥2 + z

≲ , (34)
16 µ
where the first inequality follows by Hölder’s inequality, the second inequality follows by Young’s inequal-
ity, Cauchy-Schwarz inequality, and ∥Qt ∥2 < 1, and the third inequality follows by (42), and Lemma 5.
4. Using ∥Qt ∥2 < 1, ∥ΣZY ∥2 = O(1), Cauchy-Schwarz inequality, and Lemma 3, we get,
h i
αt+1 E |δt ⊤ Qt (γt − γ∗ )⊤ ΣZY |
≲αt+1 E [∥δt ∥2 ∥γt − γ∗ ∥2 ]
q 
≤αt+1 E ∥δt ∥22 E ∥γt − γ∗ ∥22
  
√ √
dz βt αt+1 E ∥δt ∥22
 
dz βt αt+1
≤ + . (35)
2 2
Combining (31), (32), (33), (34), (35), and Lemma 5, we have
E ∥δt+1 ∥22
 

≲(1 − αt+1 µ)E ∥δt ∥2 + 4αt+1 2


βt d1+2κ 2
d4κ+ϑ/2
 
z + 4Cαt+1 z
  p p  
+ 2αt+1 µE ∥δt ∥2 /16 + 4d1+2κ βt /µ + dz βt /2 + dz βt E ∥δt ∥22 /2
 
z (36)
p
≲(1 − 7µαt+1 /8 + αt+1 dz βt )E ∥δt ∥2 + (8αt+1 βt d1+2κ 2
d4κ+ϑ/2
 
z /µ + 4Cαt+1 z )
p
+ αt+1 dz βt
p
≲(1 − 3µαt+1 /4)E ∥δt ∥2 + (8αt+1 βt d1+2κ 2
d4κ+ϑ/2
 
z /µ + 4Cαt+1 z ) + αt+1 dz βt . (37)

23

In the above, the third inequality follows by choosing βt ≤ µ2 /(64dz ), and αt+1 dz βt < 1. Then, from
(37), we have
 p 
E ∥δt ∥22 = O βt d1+2κ 4κ+ϑ/2
 
z + αt+1 dz + d z βt .

Coming back to the proof of Theorem 2, observe that, we can sharpen the bound in (35) using Lemma 6
which allows us to avoid the use of Young’s inequality. √
This leads to the following improved version of the
recursion in (37) using which we can improve the term dz βt in (29) as follows:

E ∥δt+1 ∥22 ≲(1 − 7µαt+1 /8)E ∥δt ∥2


   
 p 
+ αt+1 O βt d1+2κ
z + αt+1 d4κ+ϑ/2
z + αt+1 βt dz1/2+2κ+ϑ/4 + (βt dz )3/4
 p 
=O βt d1+2κ
z + αt+1 d4κ+ϑ/2
z + αt+1 βt dz1/2+2κ+ϑ/4 + (βt dz )3/4 .

In fact, this trick can be used repeatedly to sharpen the bound even further as shown in Lemma 7.
Lemma 7 (Final improved bound on E[∥δt ∥22 ]). Let the conditions in Theorem 2 be true. Then using
Lemma 6, we have,

E ∥δt+1 ∥22
 

r 
!
−r−1
X −i −i −i −i 1−i

≲O (dz βt )1−2 + 2
αt+1 βt 1−2 d1+(4κ+ϑ/2−1)2
z
2
+ βt (1 + αt+1 )d1+2
z
κ
,
i=0

where r is any non-negative integer.

Proof of Lemma 7. If we have


 p 
E ∥δt ∥2 = O αt+1 d4κ+ϑ/2 1+2κ
 
z + β d
t z + dz t ,
β

then from (35), we have,


h i q 
E |δt ⊤ Qt (γt − γ∗ )⊤ ΣZY | ≲ E ∥δt ∥22 E ∥γt − γ∗ ∥22
  
p 
=O αt+1 βt dz1/2+2κ+ϑ/4 + βt d1+κ
z + (d β
z t ) 3/4
. (38)

Then, similar to (36), we have,

E ∥δt+1 ∥22
 

≲(1 − αt+1 µ)E ∥δt ∥2 + 4αt+1 2


βt d1+2κ 2
d4κ+ϑ/2
 
z + 4Cαt+1 z
  p 
2 1+2κ
βt /µ + αt+1 βt d1/2+2κ+ϑ/4 1+κ 3/4

+ 2αt+1 µE ∥δt ∥ /16 + 4dz z + β d
t z + (d β
z t )
2
 
≲(1 − 7µαt+1 /8)E ∥δt ∥
1 
!

2−i 1−2−i 1+(4κ+ϑ/2−1)2−i 2−i 1+21−i κ
X
3/4
+ αt+1 O (dz βt ) + αt+1 βt dz + βt (1 + αt+1 )dz
i=0
1 
!

2−i 1−2−i 1+(4κ+ϑ/2−1)2−i 2−i 1+21−i κ
X
=O (dz βt )3/4 + αt+1 βt dz + βt (1 + αt+1 )dz .
i=0

24
Now if we repeat this step r number of times (where r is to be set later), by progressive sharpening we get
the following bound.

E ∥δt+1 ∥22
 

r 
!

1−2−r−1 2−i 1−2−i 1+(4κ+ϑ/2−1)2−i 2−i 1+21−i κ
X
≲O (dz βt ) + αt+1 βt dz + βt (1 + αt+1 )dz .
i=0

Coming back to the proof of Theorem 2, we have that by combining Lemma 4, and Lemma 7,
h i
E ∥θt − θ∗ ∥2 ≤ 2E ∥δt ∥2 + 2E ∥θet − θ∗ ∥2
   

r 
!
−r−1
X −i −i −i −i 1−i

=O (dz βt )1−2 + 2
αt+1 βt 1−2 d1+(4κ+ϑ/2−1)2
z
2
+ βt (1 + αt+1 )d1+2
z
κ
. (39)
i=0

Now, in (39), for some arbitrarily small number ι > 0, choosing

αt = min(0.5d−4κ−ϑ/2
z λ−1 −2 −2 −1+ι/2
Z Cγ , 0.5(∥γ∗ ∥2 λZ ) )t , βt = µ2 dz−1−2κ t−1+ι/2 /128,

and setting r = ⌈log2 ((ι/2)−1 − 1) − 1⌉ we get,


  
E ∥θt − θ∗ ∥2 = O max t−1+ι , t−1+ι/2 log((ι/2)−1 − 1) .
 

E.2 Proof of Lemma 5


Proof. Using the form of θ∗ , from (13) we get,
b t (θt − θ∗ ) + αt+1 (γt − γ∗ )⊤ ΣZY + αt+1 Dt θ∗ + αt+1 γt ⊤ ξZt γt (θt − θ∗ )
θt+1 − θ∗ =Q
+ αt+1 γt ⊤ ξZt γt θ∗ + αt+1 γt ⊤ ξZt Yt . (40)

b t := I − αt+1 γt ⊤ ΣZ γt = Qt + αt+1 Dt . Recall that Dt = γ ⊤ ΣZ γ∗ − γt ⊤ ΣZ γt . By Assump-



where Q ∗
tion 3.3, we have the following bound on ∥Dt ∥2 .

∥Dt ∥2 = O(λZ Cγ2 d2κ


z ). (41)

We have the following bound on E ∥Dt ∥42 by Lemma 3.


 

h i
E ∥Dt ∥42 = E ∥(γ∗ − γt )⊤ ΣZ γ∗ + γt ⊤ ΣZ (γ∗ − γt )∥42 = O(d2+4κ βt2 ).
 
z (42)

From (40), we have

∥θt+1 − θ∗ ∥22 ≤(θt − θ∗ )⊤ Q


b 2t (θt − θ∗ ) + 3αt+1
2
∥γt ⊤ ξZt γt (θt − θ∗ )∥22
+ 2αt+1 (θt − θ∗ )⊤ Q b t (γt − γ∗ )⊤ ΣZY
+ 2αt+1 (θt − θ∗ )⊤ Q
b t Dt θ∗ + A1,t + A2,t , (43)

25
where
2
A1,t =αt+1 ∥(γt − γ∗ )⊤ ΣZY ∥22 + ∥Dt θ∗ ∥22
+ 2Σ⊤ ⊤ 2 ⊤ 2

ZY (γt − γ∗ )Dt θ∗ + 3∥γt ξZt γt θ∗ ∥2 + 3∥γt ξZt Yt ∥2 , (44)

and
b t (θt − θ∗ ) + αt+1 (γt − γ∗ )⊤ ΣZY
A2,t =2αt+1 (Q
+ αt+1 Dt θ∗ )⊤ (γt ⊤ ξZt γt (θt − θ∗ ) + γt ⊤ ξZt γt θ∗ + γt ⊤ ξZt Yt ).

Define
2
A3,t :=3αt+1 b t (γt − γ∗ )⊤ ΣZY
∥γt ⊤ ξZt γt (θt − θ∗ )∥22 + 2αt+1 (θt − θ∗ )⊤ Q
+ 2αt+1 (θt − θ∗ )⊤ Q b t Dt θ∗ + A1,t + A2,t . (45)

Then, choosing Cγ2 d2κ


z λZ αt+1 < 1, which ensures ∥Qt ∥ ≤ 1, we have
b

∥θt+1 − θ∗ ∥42 ≤ ∥θt − θ∗ ∥42 + 2(θt − θ∗ )⊤ Q


b 2t (θt − θ∗ )A3,t + A23,t . (46)

We now have the following bounds:

1. Using Assumption 3.1, and Assumption 3.3,


h i
4
E ∥γt ⊤ ξZt γt (θt − θ∗ )∥42 ≲ d8κ+ϑ 4
E ∥θt − θ∗ ∥42 .
 
αt+1 z αt+1 (47)

2. We have that
h i
b t (γt − γ∗ )⊤ ΣZY )2
E ((θt − θ∗ )⊤ Q
≲ E ∥θt − θ∗ ∥2 ∥γt − γ∗ ∥2
 
q 
≤ E ∥θt − θ∗ ∥42 E ∥γt − γ∗ ∥42
  

≤ dz βt 1 + E ∥θt − θ∗ ∥42 /2,


 
(48)

where, the first inequality follows by ∥Q


b t ∥2 = O(1), and ∥ΣZY ∥2 = O(1). The second inequality follows

by Cauchy-Schwarz inequality. The last inequality follows by ab ≤ (a + b)/2, and Lemma 3.

3. We have that
h i
E ((θt − θ∗ )⊤ Q
b t D t θ∗ ) 2

≲ E ∥θt − θ∗ ∥22 ∥Dt ∥22


 
q 
≤ E ∥θt − θ∗ ∥42 E ∥Dt ∥42
  

≲ d1+2κ βt 1 + E ∥θt − θ∗ ∥42 /2,


 
z (49)

where, the first inequality follows by ∥Q


b t ∥2 = O(1), and ∥θ∗ ∥2 = O(1). The second inequality follows

by Cauchy-Schwarz inequality. The last inequality follows by ab ≤ (a + b)/2, and (42).

26
4. Using Assumption 3.1, Assumption 3.3, (42), and Lemma 3, we have
 
E A21,t = O d8κ+ϑ 4
 
z α t+1 . (50)

5. Using Young’s inequality, Assumption 3.1, Assumption 3.3, Lemma 3, ∥ΣZY ∥2 = O(1), ∥θ∗ ∥2 = O(1),
and (42), we have
h i
E A22,t ≤2αt+12 b t (θt − θ∗ ) + αt+1 (γt − γ∗ )⊤ ΣZY + αt+1 Dt θ∗ ∥42
 
E ∥Q
h i
2
+ 2αt+1 E ∥γt ⊤ ξZt γt (θt − θ∗ ) + γt ⊤ ξZt γt θ∗ + γt ⊤ ξZt Yt ∥42
2
d8κ+ϑ (1 + E ∥θt − θ∗ ∥42 ).
 
≲αt+1 z (51)

6. Using ∥Q
b t ∥2 = O(1), Assumption 3.1, and Assumption 3.3,
h i
2
E (θt − θ∗ )⊤ Q
b 2t (θt − θ∗ )∥γt ⊤ ξZt γt (θt − θ∗ )∥22 ≲αt+1
2
d4κ+ϑ/2 E ∥θt − θ∗ ∥42 .
 
αt+1 z (52)

7. We have that
h i
αt+1 E |(θt − θ∗ )⊤ Q
b 2 (θt − θ∗ )(θt − θ∗ )⊤ Q
t
b t (γt − γ∗ )⊤ ΣZY |
3
 
≲αt+1 E ∥θt − θ∗ ∥2 ∥γt − γ∗ ∥2
3/4 1/4
≤αt+1 E ∥θt − θ∗ ∥42 E ∥γt − γ∗ ∥42
 
p 3/4
≤αt+1 dz βt E ∥θt − θ∗ ∥42

√ √
3αt+1 dz βt  4
 αt+1 dz βt
≤ E ∥θt − θ∗ ∥2 + , (53)
4 4

where, the first inequality follows by ∥Q


b t ∥2 = O(1), and ∥ΣZY ∥2 = O(1), the second inequality follows
by Cauchy-Schwarz inequality, the third inequality follows by Lemma 3 and the fourth inequality follows
by Young’s inequality.

8. Similar to (53), we have,


h i
αt+1 E |(θt − θ∗ )⊤ Q
b 2t (θt − θ∗ )(θt − θ∗ )⊤ Q
b t D t θ∗ |
1/2+κ √ √
3dz αt+1 βt  4
 d1/2+κ
z αt+1 βt
≤ E ∥θt − θ∗ ∥2 + . (54)
4 4

9. Using ∥Q
b t ∥2 = O(1), Cauchy-Schwarz inequality, (50), and Young’s inequality,
h i
E (θt − θ∗ )⊤ Qb 2 (θt − θ∗ )A1,t
t

≤E ∥θt − θ∗ ∥22 A1,t


 
r
 h i
≤ E ∥θt − θ∗ ∥42 E A21,t


≲dz4κ+ϑ/2 αt+1
2
1 + E ∥θt − θ∗ ∥42 .
 
(55)

27
10. By Assumption 2.4, we have,
h i
Et (θt − θ∗ )⊤ Q
b 2 (θt − θ∗ )A2,t = 0.
t (56)

Now using Jensen’s inequality, and combining (47), (48), (49), (50), and (51), we have,
h i h i
E A23,t ≤45αt+14
E ∥γt ⊤ ξZt γt (θt − θ∗ )∥42 + 20αt+1 2 b t (γt − γ∗ )⊤ ΣZY )2
E ((θt − θ∗ )⊤ Q
 
h i
2
E ((θt − θ∗ )⊤ Qb t Dt θ∗ )2 + 5E A2 + 5E A2
   
+ 20αt+1 1,t 2,t
4
dϑz 7 +8κ E ∥θt − θ∗ ∥42 + dz αt+1 2
βt 1 + E ∥θt − θ∗ ∥42
   
≲αt+1
2
d1+2κ βt 1 + E ∥θt − θ∗ ∥42 + d8κ+ϑ 4 2
d8κ+ϑ (1 + E ∥θt − θ∗ ∥42 )
   
+ αt+1 z z αt+1 + αt+1 z
2
d8κ+ϑ 1 + E ∥θt − θ∗ ∥42 .
 
≲αt+1 z (57)

Combining (52), (53), (54), (55), and (56), we get,


h i
E (θt − θ∗ )⊤ Q b 2t (θt − θ∗ )A3,t
√ √
2 4κ+ϑ/2
 4
 3αt+1 dz βt  4
 αt+1 dz βt
≲αt+1 dz E ∥θt − θ∗ ∥2 + E ∥θt − θ∗ ∥2 +
4 4
1/2+κ √ 1/2+κ √
3dz αt+1 βt   dz αt+1 βt
E ∥θt − θ∗ ∥42 + + d4κ+ϑ/2 2
1 + E ∥θt − θ∗ ∥42
 
+ z αt+1
4 4
θ∗ ∥42 ).
2
p
≲(αt+1 dz4κ+ϑ/2 + αt+1 βt+1 d1/2+κ z )(1 + ∥θ t − (58)

Combining (46), (57), and (58), we have,


p
E ∥θt+1 − θ∗ ∥42 ≲(1 + αt+1 2
d8κ+ϑ + αt+1 βt+1 d1/2+κ ) 1 + E ∥θt − θ∗ ∥42 .
   
z z (59)
−4κ−ϑ/2
, and ∞
P 2
p
Now choosing αt , βt such that αt ≤ dz t=1 (αt+1 + αt+1 βt+1 ) < ∞, we get

E ∥θt − θ∗ ∥42 ≤ M,
 
(60)

for some constant 0 ≤ M < ∞.

E.3 Comment on the convergence of (11)


We now discuss the convergence properties of the update sequence (11), which we refer to as the conditional
stochastic optimization (CSO) based updates, which we restate below:

θt+1 = θt − αt+1 γt⊤ Zt (Xt⊤ θt − Yt ), γt+1 = γt − βt+1 Zt (Zt⊤ γt − Xt⊤ ).

Similar to (40), for the above updates, we have the following expansion:
b t (θt − θ∗ ) + αt+1 (γt − γ∗ )⊤ ΣZY + αt+1 Dt θ∗ + αt+1 γt ⊤ ξZt γ∗ (θt − θ∗ )
θt+1 − θ∗ =Q
+ αt+1 γt ⊤ ξZt γ∗ θ∗ + αt+1 γt ⊤ ξZt Yt − αt+1 γt ⊤ Zt ϵ2,t ⊤ θt ,

where ξZt = ΣZ − Zt Zt⊤ , ξZt Yt = ΣZY − Zt Yt , Q b t := I − αt+1 γt ⊤ ΣZ γ∗ = Qt + αt+1 Dt , and




Dt = (γ∗ − γt )ΣZ γ∗ .
Recall that the reason for the initial divergence of the updates in (11) are the potential negative eigen-
values of γt ⊤ ΣZ γ∗ . Here we will show that if γt ⊤ ΣZ γ∗ is positive semi-definite or γt is close enough to
γ∗ such that the negative eigenvalues (if any) are not too large in absolute values, then the updates in (11)
indeed exhibit the same convergence rate as Algorithm 2.

28
Assumption E.1. Let either of the following two conditions be true. For all t ≥ t0 ,
1. γt ΣZ γ∗ is positive semidefinite.
2. ∥γt − γ∗ ∥2 ≲ dz βt .
Note that Condition 1 of Assumption E.1 is an idealized condition which is difficult to ensure for all t
in reality. But of course if this is true, then γt ΣZ γ∗ does not have a negative eigenvalue to cause divergence
and the proof then follows exactly like Lemma 5.
Hence, we will focus on the more realistic Condition 2 of Assumption E.1 which holds true almost
surely Polyak and Juditsky (1992). Since we are interested in the asymptotic rate of convergence of CSO
updates (due to the requirement of Assumption E.1), we will only concentrate on the iterations t ≥ t0 . In
this case, the proof steps are similar to Theorem 2 except for two major differences, that we discuss below.

Difference 1: Potential negative definiteness of γt ⊤ ΣZ γ∗ :

Under Condition 2, γt ⊤ ΣZ γ∗ can indeed be negative definite. In general, if γt ⊤ ΣZ γ∗ is negative definite


then that is undesirable as we explain Section 3. In terms of the proof, we can no longer write (θt −
θ ∗ )⊤ Q
b⊤ Q
t
b t (θt −θ∗ ) ≤ ∥θt −θ∗ ∥2 (which was possible to do in (43) in the proof of Lemma 5). Subsequently,
(46) breaks down. But we will show that under Condition 2 the negative eigenvalues are not too large in
terms of absolute values. Specifically, we can write,
b⊤ Q
(θt − θ∗ )⊤ Qt
b t (θt − θ∗ )
=(θt − θ∗ )⊤ (Q2t + αt+1 Q⊤ ⊤ 2 ⊤ ∗
t Dt + αt+1 Dt Qt + αt+1 Dt Dt )(θt − θ )
(61)
≤(1 + 2αt+1 ∥Dt ∥)∥θt − θ∗ ∥2 + αt+1
2
∥Dt ∥2 ∥θt − θ∗ ∥2
p
≤(1 + 2αt+1 dz βt )∥θt − θ∗ ∥2 + αt+1
2
∥Dt ∥2 ∥θt − θ∗ ∥2 .
2 ∥D ∥2 ∥θ − θ ∥2 is of the order of A

The term αt+1 t t ∗ P3,t defined in (45).√ Now αt+1 dz βt is small enough
in the sense that we choose the stepsizes such that ∞ t=1
2
 (αt+1 + α4 t+1
 βt ) < ∞. Using this one can now
show a similar bound as (59) and consequently show E ∥θt − θ∗ ∥ is bounded.
Now let us see what happens in the absence of Condition 2. Here one could use the fact (1+2α √t+1 ∥Dt ∥) ≲
(1 + 2Cγ αt+1 dκz ) which is too big. Recall that we want something at least of the order of αt+1 βt to show
that θt sequence is bounded. One could also try to use the fact that E [∥Dt ∥] is small by Lemma 3. But
since Dt and θt are interdependent, one needs to decouple them. One way to do this would be to use
Cauchy-Shwarz inequalityas shown below.
 p p
E ∥Dt ∥∥θt − θ∗ ∥2 ≤ E [∥Dt ∥2 ] E [∥θt − θ∗ ∥4 ] ≲ dz βt E [∥θt − θ∗ ∥4 ].


But that leads to the presence ofE ∥θt − θ∗ ∥4 in (43) which is potentially problematic due to the fact that
 

on the left-hand side we have E ∥θt+1 − θ∗ ∥2 .

Difference 2: Presence of additional error term αt+1 γt ⊤ Zt ϵ2,t ⊤ θt :

When comparing (12) with (40), yet another crucial difference is the presence of the term αt+1 γt ⊤ Zt ϵ2,t ⊤ θt .
We will show by the following observations that this error term gets absorbed by other terms already present
in (40) without affecting the convergence rate. Specifically, the following holds.
1. Using the independence between Z, and ϵ2,t , and by Assumption 2.4, we have,
b t (θt − θ∗ ) + αt+1 (γt − γ∗ )⊤ ΣZY + αt+1 Dt θ∗ + αt+1 γt ⊤ ξZt γ∗ (θt − θ∗ )
Et [(Q
+ αt+1 γt ⊤ ξZt γ∗ θ∗ )⊤ γt ⊤ Zt ϵ2,t ⊤ θt ] = 0.

29
2. We also have that
h i
2
αt+1 Et (γt ⊤ ξZt Yt )⊤ γt ⊤ Zt ϵ2,t ⊤ θt
2
=αt+1 (γt ⊤ ΣZ γt ∥θ∗ ∥2 + γt ⊤ ΣZ γt θ∗⊤ (θt − θ∗ ))
2
≤αt+1 (γt ⊤ ΣZ γt ∥θ∗ ∥2 + ∥γt ⊤ ΣZ γt (θt − θ∗ )∥2 + ∥θ∗ ∥2 )

This shows that the above term is of the same order as A1,t and A3,t defined in (44), and (45).

3. Finally, we have
h i
2
αt+1 Et ∥γt ⊤ Zt ϵ2,t ⊤ θt ∥2 ≲ αt+1
2
(∥γt ∥2 ∥θt − θ∗ ∥2 + ∥γt ∥2 ∥θ∗ ∥2 ).

So this term is of the order of A3,t as well.

Combining the above facts and following similar procedure as the proof of Theorem 2, one can show
that the CSO updates achieve a similar rate under additional Assumption E.1.

30

You might also like