Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Proceedings of the 2022 Winter Simulation Conference

B. Feng, G. Pedrielli, Y. Peng, S. Shashaani, E. Song, C.G. Corlu, L.H. Lee, E.P. Chew, T. Roeder, and
P. Lendermann, eds.

SAMPLE AVERAGE APPROXIMATION OVER FUNCTION SPACES:


STATISTICAL CONSISTENCY AND RATE OF CONVERGENCE

Zihe Zhou Raghu Pasupathy


Harsha Honnappa

School of Industrial Engineering Department of Statistics


Purdue University Purdue University
150 N. University St 150 N. University St
West Lafayette, IN 47907, USA West Lafayette, IN, 47907
and
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Chennai, 600036, INDIA

ABSTRACT
This paper considers sample average approximation (SAA) of a general class of stochastic optimization
problems over a function space constraint set and driven by “regulated” Gaussian processes. We estab-
lish statistical consistency by proving equiconvergence of the SAA estimator via a sophisticated sample
complexity result. Next, recognizing that implementation over such infinite-dimensional spaces is possible
only if numerical optimization is performed over a finite-dimensional subspace of the constraint set, and if
sample paths of the driving process can be generated over a finite grid, we identify the decay rate of the
SAA estimator’s expected optimality gap as a function of the optimization error, Monte Carlo sampling
error, path generation approximation error, and subspace projection error.

1 INTRODUCTION
We consider infinite-dimensional stochastic optimization problems of the form
Z Z
˜ J˜◦ Γ (F + z)dπ0 (z)
Γ

min J(F) = J(x)dπF (x) =
C C
s.t. F ∈ F , (OPT)

where J˜ : C → R is some “cost” functional, C is the space of R-valued continuous functions with domain
[0, T ] and equipped with the supremum norm, F ⊂ C is a subspace of C, and F ∈ F is the “decision
variable”. The functional J˜ takes as argument “paths” X F := Γ(F + Z), where Γ : C → C is a continuous
“regulator” map that confines Z + F to a subdomain of C, Z is a C-valued Gaussian random variable that
induces a measure π0 on the Borel space (C, C ) and X F induces the ‘push-forward’ measure πFΓ (see
Definition 8 below).

1.1 Motivating Examples


Roughly speaking, the problem formulation in (OPT) asks for the extent to which Gaussian paths Z should
be (additively) shifted so that the resulting cost J˜ ◦ Γ(F + Z) is minimized in expectation. And, as the
following examples suggest, (OPT) subsumes a multitude of problems in operations research, optimal

978-1-6654-7661-4/22/$31.00 ©2022 IEEE 61


Zhou, Honnappa, and Pasupathy

control and machine learning, when formulated as stochastic optimization problems driven by Gaussian
processes.
Example 1 Let F = W01,2 the Sobolev space consisting of R-valued absolutely continuous functions with
L2 -integrable derivatives and initial value 0. If Z = σ B, where B is a Wiener process with measure π0 and
σ > 0, then (C, F , π0 ) is the classic Cameron-Martin-Wiener space. Let Γ be the so-called Skorokhod
regulator map (Chen and Yao 2001, Ch. 5), which satisfies Γ(x)(·) = x(·) + sup0≤s≤· max{−(x(s)), 0} for
any function x ∈ C. Then, the random variable X F is a so-called reflected Brownian motion (RBM) with drift
˜ := a1 0T g(x(s))ds + a2 G(x(T )), where (a1 , a2 ) ∈ R2 , and
R
F. Consider a cost functional over x ∈ C, x 7→ J(x)
g : R → R and G : R → R are well-defined functions. The corresponding optimization problem represents
a class of ‘open-loop’ optimal control problems over W01,2 , driven by an RBM. This class of problems
arises in nonstationary queueing network control, scheduling and inventory control.
Example 2 Suppose that X F = F + Z, Z = σ B and F := {F ∈ C : LF p = 0, p(0) = δx0 }, where LF ≡
2
∂t + F 0 (t)∂x − σ2 ∂xx is the Fokker-Planck partial differential operator corresponding to X F , F 0 (t) = dF(t)/dt
and p(0) = δx0 is the initial condition. The solution of this equation is the marginal density pF (t, ·) of X F (t).
Consider the cost functional J(x)˜ := log (pF (T, x(T ))/p0 (x(T ))) where p0 (·) is a reference density function,
 
and the optimization problem minF∈F E[J˜ X F ] = Rd log pFp0(T,z)
 R
(z) pF (T, z)dx. Roughly speaking, this
problem computes arbitrary Gaussian approximations pF (T, ·) to p0 (·) by minimizing the Kullback-Leibler
divergence between these densities. This formulation underlies the use of so-called stochastic normalizing
flows for variational inference (VI) in probabilistic machine learning.

1.2 Method and Overview of Results


Analytical solutions to (OPT) are accessible only in a few special cases, and simulation optimization is
natural and almost imperative. This paper explores the use of sample average approximation (SAA) toward
solving (OPT). Specifically, recall that the SAA approximation of (OPT) is defined as
( )
1 N ˜ iid
min JN (F) := ∑ (J ◦ Γ)(Z j + F) , Z j ∼ π0
N j=1
s.t. F ∈ F , (MC-OPT)

where the random variables Z := (Z1 , · · · , ZN ) are independent and identically distributed (iid). The main
idea in SAA is the recognition that since (MC-OPT) is a deterministic optimization problem that in a sense
approximates (OPT), a solution to (MC-OPT) might reasonably be expected to approximate a solution
to (OPT). While this idea is sound in principle, the context raises a number of statistical questions that
need resolution. Accordingly, this paper establishes the following two “first order” results.
1. Asymptotic Consistency. We first demonstrate that the optimal value and optimizers of (MC-OPT) are
asymptotically consistent (in the number of samples N from π0 ) by proving convergence in probability. Our
approach to this first establishes a novel uniform equiconvergence result over function spaces by √ showing
that the Gaussian complexity of the SAA estimator of the objective is inversely proportional to N (for
every N), assuming the diameter of the constraint set F is bounded.
2. Rate of Convergence. The Gaussian paths {Z j , j ≥ 1} from π0 in (MC-OPT) cannot in general be
sampled directly. For instance, if Z1 is a Brownian motion, then sample paths may (only) be approximated
using Euler-Maruyama or Euler-Milstein schemes (Asmussen and Glynn 2007). In other words, the
problem in (MC-OPT) is “fictitious” from the standpoint of computation and a further approximation
to (MC-OPT) is necessary for implementation. Furthermore, since F might be infinite dimensional, the
solving of (MC-OPT) must be performed (only) over a finite-dimensional subspace of the constraint set F
to allow computation using a method such as gradient descent. Our second main result is a convergence
rate result that accounts for the above two sources of error, and quantifies the expected decay rate of the true

62
Zhou, Honnappa, and Pasupathy

optimality gap of a solution obtained by executing mirror descent on a finite-dimensional approximation


of (MC-OPT) generated using approximations to {Z j , j ≥ 1}. The convergence rate result clarifies the
relationship between four sources of error: (i) numerical optimization error due to the use of an iterative
scheme such as mirror descent; (ii) Monte Carlo sampling error; (iii) path approximation error due to
“time” discretization; and (iv) projection error due to the use of a finite-dimensional subspace in lieu of F .

1.3 Literature
The use of SAA methodology toward stochastic optimization in Rd has an extensive literature, as compre-
hensively surveyed in (Shapiro 2003; Shapiro et al. 2009; Kim et al. 2014; Pasupathy 2010; Banholzer
et al. 2019). Corresponding results in the non-Euclidean setting appear in (Dupačová and Wets 1988;
Robinson 1996)) where the feasible F is assumed to be finite dimensional. Especially relevant to what we
present here is the extensive treatment of consistency and uniform rate properties of M-estimators (van de
Geer 2000; Bose 1998) in normed spaces. (A solution to (OPT) is indeed an M-estimator.) However, we
are not aware of infinite-dimensional SAA rate results in the typical context where the SAA estimator is
not available in “closed form” but is computed using an iterative optimization technique. Nonetheless, as
pointed out in the introduction, there are a number of problems that require optimization over function
spaces, wherein SAA is a natural approximation to such problems.

2 PRELIMINARIES
In this section, we discuss mathematical preliminaries including key definitions, assumptions, and notation.

2.1 Key Defintions


In the definitions that follow the space F is a subspace of a normed space X over R. Recall that a Banach
space is a complete normed space.
Definition 1 (Linear Functionals) x : F → R is called a linear functional on the (real) normed space F if

x(αF) = αx(F), α ∈ R; x(F1 + F2 ) = x(F1 ) + x(F2 ), F1 , F2 ∈ F .

A linear functional x : F → R is said to be a bounded linear functional if

kxk := sup {|x(F)| : kFk = 1, F ∈ F } < ∞.

It can be shown that x : F → R is a bounded linear functional if and only if x is continuous on F , and that
continuity of x at any point F0 ∈ F implies boundedness of x. (It is important that x : F → R being bounded
does not mean supF∈F |x(F)| < ∞; indeed, it is routinely the case that kxk < ∞ but supF∈F |x(F)| = ∞.)
Definition 2 (Dual Space, Adjoint Space, Conjugate Space) The space F ∗ of linear functionals on F
is called the algebraic dual space of F . F ∗ should be distinguished from the dual space F 0 , which is
the space of of bounded linear functionals on F . F ∗ is sometimes also called the adjoint space or the
conjugate space of F .
Definition 3 (Dual Norm) The (operator) norm of the functional T ∈ F ∗ is called the dual norm or
conjugate norm of T :
 
|T x|
kT k∗ := sup : x ∈ F , x 6= 0
kxk
= sup {|T x| : x ∈ F , kxk = 1} . (1)

63
Zhou, Honnappa, and Pasupathy

Definition 4 (Right and Left Directional Derivatives) The right directional derivative J+0 (F, v) and the left
directional derivative J−0 (F, v) of the functional J : F → R at the point F ∈ F are defined as
1
J+0 (F, v) := lim+ (J(F + tv) − J(F)) ;
t→0 t
1
J−0 (F, v) := lim− (J(F + tv) − J(F)) .
t→0 t

Definition 5 (Gâteaux and Fréchet Differentiability) The functional J : F → R is Gâteaux differentiable


at F ∈ F if the limit
1
SJ (F)(v) := lim (J(F + tv) − J(F)) (2)
t→0 t
exists for each v ∈ F , and SJ (F) ∈ F 0 , that is, SJ (F)(v) : F → R is a bounded linear functional.
The functional J : F → R is Fréchet differentiable if the limit in (2) is uniform in v, that is,
|J(F + v) − (J(F) + SJ (F)(v))| = o(kvk), v ∈ F.
From Definition 4 and Defintion 5, we see that Fréchet differentiability ⇒ Gâteaux differentiability ⇒
Directional Derivative Existence. Also, if F is finite-dimensional and J is Lipschitz in some neighborhood
of F ∈ F , then J is Fréchet differentiable at F if and only if it is Gâteaux differentiable at F.
Definition 6 (Subgradient and Subdifferentials of a Convex Functional) The functional J : F → R is convex
if for any α ∈ [0, 1], J(αF1 + (1 − α)F2 ) ≥ αJ(F1 ) + (1 − α)J(F2 ), ∀F1 , F2 ∈ F . SJ (F0 ) ∈ F 0 is called a
subgradient to J at F0 ∈ F if
J(F) ≥ J(F0 ) + SJ (F0 )(F − F0 ). (3)
The set ∂ J(F0 ) of subgradients to J at F0 is called the subdifferential to J at F0 . Convex functionals have
a subdifferential structure in the sense that if J : F → R is convex, then ∂ J(F0 ) 6= 0/ for each F0 ∈ F ◦ ;
conversely, if ∂ J(F) 6= 0/ for each F ∈ F ◦ , then J is necessarily a convex functional.
Definition 7 (Mirror Map) Suppose D̄ ⊃ F and D ∩ F 6= 0. / A map ψ : D → R is called a mirror map
if it satisfies the following three conditions:

1. ψ is Fréchet differentiable and strongly convex in D;


2. for each y ∈ F ∗ , there exists F ∈ F such that ∇ψ(F) = y; and
3. limF→∂ D kψ(F)k∗ → +∞.

Definition 8 (Push-Forward Measure) This paper focuses on stochastic optimization problems defined with
respect to regulated Gaussian processes, X F = Γ(Z + F), whose paths are confined to a subdomain of C. To
define the measure corresponding to such a regulated process, define the shift operator Tg (x) : (F ,C) → C
as Tg (x) = g + x, and the push-forward measure corresponding to the shift operator πg (A) := (Tg )∗ (π0 )(A) =
π0 (Tg−1 (A)) for any A ∈ C . Then, the push-forward measure corresponding to X F is defined as πFΓ (A) =
πF (Γ−1 (A)), for any A ∈ C .

2.2 Key Assumptions


We now list key assumptions on the cost functional J˜ : C → R to be invoked in the results that follow.
Assumption 1 The cost functional J˜ : C → R is Fréchet differentiable.
Assumption 2 The cost functional J˜ : C → R satisfies
˜ + F1 ) − J(x
|J(x ˜ + F2 )| ≤ Kx kF1 − F2 k∞ , (4)

where Kx > 0 for every sample path x ∈ C, F1 , F2 ∈ F and E[KZp ] < +∞ for some 2 ≤ p < +∞.

64
Zhou, Honnappa, and Pasupathy

Assumption 3 The cost functional J˜ is L-Lipschitz in z ∈ C, i.e., for any F ∈ F and z, z0 ∈ C, we have
|J(z ˜ 0 + F)| ≤ κkz − z0 k∞ .
˜ + F) − J(z (5)
Assumption 4 The cost functionalR J˜ : C → R is sufficiently regular such that the composite functional
J˜◦ Γ : C → R is integrable, that is, C (J˜◦ Γ)(x + F) dπ0 (x) < +∞.
Assumption 5 The composition J˜ ◦ Γ : C × F → C is LΓ,Z -Lipschitz in F, that is, for any Z ∈ C,
kJ˜◦ Γ(Z + F1 ) − J˜◦ Γ(Z + F2 )k ≤ LΓ,Z kF1 − F2 k, where E[LΓ,Z
2 ] < ∞.

This assumption is easily satisfied by the Skorokhod regulator which is 2-Lipschitz continuous in the
space C.
We assume that the subspace F ⊆ C satisfies
Assumption 6 F has a finite diameter. That is, diam(F ) := supF1 ,F2 ∈F kF1 − F2 k∞ < +∞.
This is a reasonably strong assumption, that is nonetheless satisfied by many problems settings; for instance,
if the function class F is parameterized by a compact set. We also believe that it should be possible to
relax this condition, at the expense of more complicated computations.

3 EQUICONVERGENCE AND CONSISTENCY


Our approach to proving consistency is to first establish equiconvergence of the SAA functional over the
subspace F . For simplicity, at the outset let us assume that Γ is the identity map. We will subsequently
observe that the forthcoming results extend to the reflected case. We prove equiconvergence by bounding
the Gaussian complexity of the SAA, defined as
" ( )#
1 N ˜
RN (F ) := Eg,π0 sup ∑ gi J(Zi + F) , (6)
F∈F N i=1

where the expectation is taken with respect to the Gaussian random vector g ∼ N (0, IN×N ), which is
independent of the iid samples Z := (Z1 , · · · , ZN ).
Next, define the RN -valued random field G· (·) as F 7→ GF (Z) := J(Z
˜ 1 + F), · · · J(Z
˜ N + F) . For each


z ∈ CN define the set B ≡ B(z) := {GF (z) : F ∈ F } ⊆ RN , and the pseudometric d : B × B → [0, ∞),
given by
1
d(x, y) = √ kK(z)k p kFx − Fy k∞ , (7)
N
1/p
where Fx , Fy ∈ F correspond to x, y (respectively) through the map G· , for any x ∈ RN kxk p := ∑Ni=1 |xi | p
and Kzi are the Lipschitz variables defined in Assumption 2.
Let {YF (Z) : F ∈ F } be the real-valued random field defined as YF (Z) := √1N ∑Ni=1 gi J(Z
˜ i +F), where g is
a N-dimensional standard Gaussian random vector, as before. The next lemma shows that {YF (Z) : F ∈ F }
satisfies a sub-Gaussian concentration inequality, conditioned on Z.
 
u2
Lemma 1 For any F, G ∈ F such that kF −Gk∞ 6= 0 we have P (|YF (z) −YG (z)| > u| Z = z) ≤ 2 exp − 2L2 d(GF (z),G 2 ,
G (z))
where d(·, ·) is defined in (7) and L := sup kykq = 1 for q ≥ 2.
y∈RN :kyk2 =1

Proof. Fix F, G ∈ F such that F 6= G. By Hölder’s inequality we have |YF (Z)−YG (Z)| ≤ √1N kgkq kGF (Z)−
GG (Z)k p , where 1
p + 1q = 1 and q ≥ 2. Next, following Assumption 2, we have
!1/p !1/p
N N
kGF − GG k p = ˜ i + G) p
˜ i + F) − J(Z ∑ |KZ | p kF − Gk∞p

∑ J(Z ≤ i = kF − Gk∞ kK(Z)k p .
i=1 i=1

65
Zhou, Honnappa, and Pasupathy

It follows that
   √ 
u N
P |YF (z) −YG (z)| > u Z = z ≤ P kgkq > Z=z . (8)
kK(z)k p kF − Gk∞

It is straightfoward to see that x 7→ kxkq is a Lipschitz function from RN to R. Then, by (Boucheron, Lugosi,
and Massart
 2013, Theorem 5.6), kgkq satisfies the sub-Gaussian concentration inequality P (kgkq > ε) ≤
2
ε
2 exp − 2L 2 , where L is defined above. Applying this to (8) completes the proof.

Next, we show that an ε-cover under the pseudometric can be “translated” into a corresponding ε-cover
under the supremum-norm.
Lemma 2 Fix ε > 0. Let z = (z1 , · · · , zN ) ∈ CN and suppose B1 , · · · , Bl ⊂ RN is an ε-cover of B = {GF (z) :
F ∈ F } under the pseudometric (7). Then, √
there exist subsets B01 , · · · , B0l that form an ε 0 -cover of F under
the supremum norm k · k∞ , with ε 0 = kK(z)k
ε N
p
.

Proof. By definition, Bi = {y ∈ B : d(yi , y) ≤ ε} for some yi ∈ B. Consider the set {F ∈ F : GF (z) ∈


Bi } =: B̃i . For any F ∈ B̃i , we have d(yi , GF (z)) = √1N kK(z)k p kFyi − Fk∞ ≤ ε. It follows that kFyi − Fk∞ ≤

ε N
kK(z)k p = ε 0.
Now, let F 0 ∈ F \ ∪li=1 B̃i . It follows that min1≤i≤l kFyi − F 0 k > ε 0 , implying that d(yi , GF 0 (z)) > ε.
Therefore, GF 0 (z) 6∈ ∪li=1 Bi . But, this is a contradiction since B1 , · · · , Bl is an ε-cover of B, implying that
F \ ∪li=1 B̃i = 0.
/
The proof of equiconvergence in Theorem 2 below follows as a consequence of Proposition 1 and
Proposition 2 below.
Proposition 1 Suppose Assumption 1 and Assumption 6 hold. Furthermore, suppose that log N(ε, F , k ·
k∞ ) ≤ ε −1/α for some α > 1 and ε > 0. Then, for any F0 ∈ F , there exists a constant 0 < C < +∞ such
that
    α−1
CkK(z)k p 1 α
Eg sup |YF −YF0 | Z = z ≤ √ diam(F ) . (9)
F∈F N 2

Proof. It is straightforward to see that {YF (Z) : F ∈ F } is a separable random field. Further, by Lemma 1
{YF (Z) : F ∈ F } is sub-Gaussian. By Assumption 6, and the definition of the pseudometric d, we have
D := supz1 ,z2 ∈B d(z1 , z2 ) < +∞. By Dudley’s Theorem for separable random fields it follows that there
 
R D/2 p
exists a constant 0 < C < +∞ such that Eg supF∈F |YF −YF0 | Z = z ≤ C0 0
0 log N(ε, B, d)dε. By

Lemma 2 it follows that N(ε, B, d) = N(ε 0 , F , k · k∞ ), where ε 0 = ε kK(z)k
N
. Then, changing variables in
√ p

the integral above to ε 0 , we have D0 = D N/kK(z)k p = diam(F ) and


Z D/2 p Z D0 /2 p
kK(z)k p
log N(ε, B, d)dε = √ log N(ε, F , k · k∞ )dε
0 N 0
  α−1
kK(z)k p α 1 α
≤ √ diam(F ) .
N α −1 2

Note that by Assumption 6 it follows that the right hand side above is finite. Setting C = C0 α−1
α
completes
the proof.

66
Zhou, Honnappa, and Pasupathy

Recall that the sub-Gaussian diameter for a metric probability space (X , d, π) with metric d and measure
2 2
π is defined as ∆2SG (X ) := σ ∗ (Y ) where σ ∗ (Y ) is the smallest σ that satisfies E eλY ≤ eσ λ /2 , λ ∈ R,
 
Y := εd(X, X 0 ) is the symmetrized distance on the metric space X , ε = ±1 with probability 1/2 and X, X 0
are X -valued random variables with measure π. Consider the following generalization of McDiarmid’s
inequality.
Theorem 1 (Theorem 1 (Kontorovich 2014)) Let (X , d, π) be a metric space that satisfies ∆SG (X ) < +∞, 
2
and ϕ : X N → R is 1-Lipschitz, then Eπ [ϕ(Z)] < +∞, and π (|ϕ(Z) − Eπ [ϕ(Z)]| > t) ≤ 2 exp − 2N∆2t (X ) ,
SG
where Z = (Z1 , · · · , ZN ) is an independent sample drawn from π.
Observe that this result significantly loosens the requirements in McDiarmid’s inequality from bound-
edness to Lipschitz continuity.
Proposition 2 Let Z = (Z1 , · · · , ZN ) be N i.i.d. random variables with measure π0 . Suppose the cost function
satisfies Assumption 3. Suppose that the metric probability space (C, k · k∞ , π0 ) satisfies ∆SG (C) < +∞.
Then for any F ∈ F and δ > 0, with probability at least 1 − δ , we have
" ( )#  1/2
1 N ˜ 1 N ˜ 2κ 2 ∆2SG (C) log(1/δ )
J(F) ≤ ∑ J(Zi + F) + E sup J(G) − ∑ J(Zi + G) + . (10)
N i=1 G∈F N i=1 N

Remark: We note that the assumption that ∆SG (C) < +∞ is reasonable – for instance, it is satisfied in the
case where π0 is the Wiener measure.
Proof.  We start by considering the functional ϕ : CN → R defined as ϕ(z) :=
supF∈F J(F) − N1 ∑Ni=1 J(z ˜ i + F) , for any z ∈ CN . Let z = (z1 , · · · , zN ) ∈ CN , z0 = (z01 , · · · , z0N ) ∈ CN ; the
metric distance between these vectors of functions is given by kz − z0 k = ∑Ni=1 kzi − z0i k∞ . Also, define the
sequence of vectors z1 = (z01 , z2 , · · · , zN ), z2 = (z01 , z02 , z3 , · · · , zN ), . . . , zN = (z01 , z02 , . . . , z0N ) ≡ z0 . Using the
triangle inequality, it is straightforward to see that

|ϕ(z) − ϕ(z0 )| = |ϕ(z) − ϕ(z1 ) + ϕ(z1 ) − ϕ(z2 ) + · · · + ϕ(zN−1 ) − ϕ(z0 )|


≤ |ϕ(z) − ϕ(z1 )| + |ϕ(z1 ) − ϕ(z2 )| + · · · + |ϕ(zN−1 ) − ϕ(z0 )|, (11)

where each pair of zk−1 and zk differs only by the kth element. Let zk (i) represent the ith element of the
kth vector and F ∗ ∈ F be the function that achieves the supremum in ϕ(z). For any such pair of vectors,
we have
( )
k−1 k 1 N ˜ k−1
ϕ(z ) − ϕ(z ) = sup J(F) − ∑ J(z (i) + F)
F∈F N i=1
( ! )
1 N ˜ k−1 1 ˜ ˜ 0k + F)

− sup J(F) − ∑ J(z (i) + F) + J(zk + F) − J(z
F∈F N i=1 N
1 ˜ ˜ 0k + F ∗ ) ≤ κ kzk − z0k k∞ .
J(zk + F ∗ ) − J(z


N N
where the last inequality follows from Assumption 3. Consequently, substituting this into (11) we have

|ϕ(z) − ϕ(z0 )| ≤ |ϕ(z) − ϕ(z1 )| + |ϕ(z1 ) − ϕ(z2 )| + · · · + |ϕ(zN−1 ) − ϕ(z0 )|


κ  κ
≤ kz1 − z01 k∞ + kz2 − z02 k∞ + · · · + kzN − z0N k∞ = kz − z0 k. (12)
N N
L
In other words, the functional ϕ is N -Lipschitz continuous.

67
Zhou, Honnappa, and Pasupathy

Now, by hypothesis we have ∆2SG (C) < +∞, and therefore  applying Theorem 1 we have
2 2
P (ϕ − E(ϕ) > t) = P NL (ϕ − E(ϕ)) > NL t ≤ exp − 2κ 2 Nt . Now, for any δ > 0, exp − 2L2 ∆Nt2 (C) ≤ δ

∆2SG (C) SG
2κ ∆SG (C) log(1/δ ) 1/2
 2 2 
implies that t ≥ N . Hence, with probability at least 1 − δ , we have ϕ < E(ϕ) +
 2 2 1/2
2κ ∆SG (C) log(1/δ )
N , which yields the final expression in (10).

Now, our main sample complexity result follows by combining Proposition 1 and Proposition 2. By
taking an expectation with respect to π0 over (9) it follows that the Gaussian complexity of the function
space F is
  α−1
C E [kK(Z)k p ] 1 α
RN (F ) := √ diam(F ) , (13)
N 2
provided Eπ0 [kK(Z)k p ] < +∞; this is a consequence of Assumption 2.
Theorem 2 Let F ⊆ C satisfy Assumption 6 and suppose that log N(ε, F , k · k∞ ) ≤ ε −1/α for α ≥ 1 and
ε > 0. Suppose the cost function J˜ satisfies Assumption 1, Assumption 2 (for some 1 ≤ p < +∞) and
Assumption 3. Let Z = (Z1 , · · · , ZN ) be an i.i.d. sample drawn from π0 . Then, for any δ > 0 and some
1 ≤ p < +∞, with probability at least 1 − δ , for any F ∈ F we have
r !
1 N ˜ log(1/δ )
J(F) ≤ ∑ J(Zi + F) + 2RN (F ) + O .
N i=1 N

Proof. We sketch the proof. Bystandard considerations (see


 (Bartlett and Mendelson 2002) for instance),
it can be shown that E supF∈F J(F) − N1 ∑Ni=1 J(Z ˜ i + F) ≤ 2RN (F ). The theorem follows by using
this to bound the right hand side in (10).
Recall from Assumption 5 that the composed functional J˜◦Γ is LΓ,Z -Lipschitz continuous. Consequently,
the consistency result proved in Theorem 2 holds for the composed functional as well. Theorem 2 yields a
uniform convergence (or ‘equiconvergence’) result for J˜ and, in particular as an immediate consequence we
P
have |J ∗ − J˜N∗ | → 0 as N → ∞, where J ∗ := infF∈F J(F) and J˜N∗ := infF∈F N1 ∑Ni=1 J(Z ˜ i + F). Furthermore,

let Πn := arg infF∈F N ∑i=1 J(Zi + F) and π := arg infF∈F J(F). Consider two scenarios, J ∗ > J˜N∗ and
1 N ˜ ∗

J ∗ < J˜N∗ . In the former case, |J ∗ − J˜N∗ | < |J(Π∗n ) − J˜N∗ |. In the latter case, |J ∗ − J˜N∗ | < | N1 ∑Ni=1 J(Z
˜ i + π ∗ ) − J ∗ |.
Therefore,
( )
N N
1 ˜ i + π ∗ ) − J ∗ | < |J(Π∗n ) − J˜N∗ | + | 1 ∑ J(Z P
|J ∗ − J˜N∗ | < max |J(Π∗n ) − J˜N∗ |, | ∑ J(Z ˜ i + π ∗) − J∗| → 0
N i=1 N i=1

as N → ∞ by Theorem 2.

4 RATE OF CONVERGENCE
Let’s introduce further notation to keep our exposition clear. Recall that F is a compact subspace of
the space of continuous functions on [0, T ]. Suppose F ∈ F and that we can generate N independent
realizations of the process {(Zh (t),t ∈ [0, T ]} with measure π0,h
Γ and having continuous paths and having

possible non-differentiabilities over the partition points


0 = t0 < t1 < t2 < · · · < tn−1 = T,
where
h = h(n) := max{t1 − t0 ,t2 − t1 , . . . ,tn − tn−1 }.

68
Zhou, Honnappa, and Pasupathy

Let Fn denote an n-dimensional (n < ∞) closed subspace of F such that elements in F can be approached
by a sequence of elements in Fn , that is, for every F ∈ F , there exists {Fn , n ≥ 1}, Fn ∈ Fn such that
kFn − Fk → 0. An example of Fn is the span of the first n Legendre polynomials (Kreyszig 1989, pp. 176)
on the interval [0, T ]. More generally, Fn can be chosen as the span of the first n elements of any Schauder
basis of F . (Recall that a sequence {Pj , j ≥ 1} of vectors in a normed space F is called a Schauder basis
of F if for every F ∈ F there is a unique sequence {a j , j ≥ 1} of scalars such that kF − ∑nj=1 a j Pj k → 0
as n → ∞.) Consequently, we assume that
Assumption 7 The closed finite-dimensional function subspace Fn ⊂ F is such that

ψ(n) := sup kF − ΠFn (F)k = O(g(n)), (14)


F∈F

where g(n) → 0 as n → ∞.
With the above notation in place, the SAA problem (MC-n-OPT) approximating (OPT) is:

( )
N
1 iid
min. JN,h (F) :=
N ∑ J˜◦ Γ(Zh, j + F) , Zh, j ∼ π0,h
j=1

s.t. F ∈ Fn (MC-n-OPT)

where the measure π0,h approximates the measure π0 . For brevity, we will write Zh, j as Z j in the remainder
of this section. To facilitate a basic result that quantifies the quality of the solution to (MC-n-OPT) as an
estimator to the solution to (OPT) we assume that
Assumption 8 The random functional F 7→ J˜◦ Γ(Z + F) is convex in F.
Observe that the problem in (MC-n-OPT) is obtained by replacing the expectation appearing in (OPT) by
a Monte Carlo sum obtained by generating N samples of a process {XhF (t),t ∈ [0, T ]} that approximates the
process {X F (t),t ∈ [0, T ]}. We define the following optimal values and optimal solution (sets) corresponding
to (OPT) and (MC-n-OPT), the existence of which will become evident.

J ∗ := inf {J(F)}; F ∗ :=arg inf{J(F)} (15)


F∈F F∈F
∗ ∗
JN,n := inf {JN,h (F)}; FN,n :=arg inf{JN,h (F)}.
F∈Fn F∈Fn

It is important that the optimization in (MC-n-OPT) be performed over a finite-dimensional subspace


Fn of F so as to allow computation using a method such as gradient descent (Nesterov 2004). Also, in (15),
∗ and F ∗ on the partition width h used to generate
notice that we have suppressed the dependence of JN,n N,n
Monte Carlo samples from the measure π0,h . A result we present shortly will imply that the sub-space
dimension n and the partition width h bear a certain relationship that can be exploited to maximize the
∗ − J∗ .
decay rate of the expected optimality gap E JN,n

4.1 Consistency and Rate of the SAA Estimator


We call any solution FN,n∗ ∈ F ∗ to (MC-n-OPT) an SAA estimator of the solution to (OPT). An SAA
N,n
estimator cannot be obtained in “closed form” in general. However, given that (MC-n-OPT) is a deterministic
convex optimization problem over a closed finite-dimensional subspace, one of various existing iterative

techniques, e.g., mirror descent (Bubeck 2015), can be used to generate a sequence {FN,n,k , k ≥ 1} ⊂ FN,n
that converges to a point in FN,n , that is, FN,n,k → FN,n as k → ∞ for fixed N, n, h. Before we present the
∗ ∗ ∗

main result that characterizes the accuracy of FN,n,k , we state a lemma that will be invoked.

69
Zhou, Honnappa, and Pasupathy

Lemma 3 Let Assumption 2 and Assumption 6 hold, and suppose there exists F0 ∈ F such that
iid
˜ F0 )) < ∞;
σ02 (h) := Var(J(X XhF0 ∼ πFΓ0 ,h . (16)
h

Then,
 q 2
˜ hF )) ≤ σ0 (h) + diam(F ) E[L2 ] .
sup Var(J(X (17)
Γ,Z
F∈F

Proof. We can write


J(X ˜ F0 ) + J(X
˜ hF ) = J(X ˜ F0 ),
˜ hF ) − J(X (18)
h h
and due to Assumption 5,
J(X ˜ F0 ) ≤ LΓ,Z diam(F )
˜ hF ) − J(X (19)
h

2 ] < ∞. From (19) we see that


where E[LΓ,Z

˜ F0 )) ≤ E LΓ,Z
 2 
˜ hF ) − J(X
Var(J(X h diam2 (F ). (20)

Use (18) and (20) along with (16) to conclude that the assertion of the lemma holds.

We now present the main rate result governing the solution estimator FN,n,k of (MC-n-OPT).
iid
Theorem 3 Let Assumptions 2, 6, 7, 8 hold, and suppose that the method used to generate paths XhF ∼ πF,h
Γ

exhibits weak convergence order β , implying that there exists `1 < ∞ such that

˜ hF ) − J(F) ≤ `1 hβ .
 
sup E J(X (21)
F∈F

Furthermore, suppose mirror descent (Bubeck 2015, pp. 80) is executed for k steps on (MC-n-OPT):

FN,n, j+1 = sup Dψ (x, GN,n, j+1 );


x∈Fn ∩D
∇ψ(GN,n, j+1 ) = ∇ψ(FN,n, j ) − ηSJN,h (FN,n, j ); j = 0, 1, . . . , k − 1
k
∗ 1
FN,n,k :=
k ∑ FN,n, j , (22)
j=1

where ψ : D ⊂ F → R is a chosen ρ-strongly convex, mirror-map (see Definition 7) with Fn ∩ D 6= 0,


/
the Bregman divergence

Dψ (x, y) := ψ(x) − (ψ(y) + h∇ψ(y), x − yi) , ∀x, y ∈ D,

and the step size r


R 2ρ
η = η0 , η0 ∈ (0, 1)
K̄ k
where R2 := supx∈Fn ∩D ψ(x) − ψ(FN,n,0 ), and K̄ := N −1 ∑Nj=1 KZ j is the iid sample mean of Lipschitz
constants KZ j , j = 1, 2, . . . , N appearing in Assumption 2 satisfying

sup kSJN,h (F)k∗ ≤ K̄; SJN,h (F) ∈ ∂ JN,h (F); E[KZ2 j ] < ∞, (23)
F∈F

70
Zhou, Honnappa, and Pasupathy

where SJN,h (F) is a subgradient and ∂ JN,h (F) the subdifferential of the convex functional JN,h at the point
F. Then, for all k ≥ 1,
∗ c1 c2
) − J(F ∗ ) ≤ √ + √ + c3 hβ + c4 g(n),
 
0 ≤ E J(FN,n,k (24)
k N
where
s   1/2
2 2 1 2
c1 := E[R ] Var(KZ ) + E[KZ ] ;
ρ k
3  q
2 ] + σ (h) ;

c2 := √ diam(F ) E[LΓ,Z 0
N
c3 := `1 ; and
c4 := E [KZ ] . (25)
Proof. Observe that

0 ≤ J(FN,n,k ) − J(F ∗ ) = J(FN,n,k
∗ ∗
) − JN,h (FN,n,k ∗
) + JN,h (FN,n,k ∗
) − JN,h (FN,n )

+ JN,h (FN,n ) − J(Fn∗ ) + J(Fn∗ ) − J(F ∗ )
∗ ∗
≤ JN,h (FN,n,k ) − JN,h (FN,n )+ ∑ |JN,h (F) − J(F)| + J(Fn∗ ) − J(F ∗ )

F∈{FN,n,k ∗ ,F ∗ }
,FN,n n

opt. error sampling error


z }| { z }| {
∗ ∗
≤ JN,h (FN,n,k ) − JN,h (FN,n )+ ∑ |JN,h (F) − E [JN,h (F)] |

F∈{FN,n,k ∗ ,F ∗ }
,FN,n n

+ ∑ |E [JN,h (F)] − J(F)| − S (F ∗ ) kF ∗ − F ∗ k, (26)



F∈{FN,n,k ∗ ,F ∗ }
,FN,n
| {z } | J n {zn }
n
approx. error proj. error

where the penultimate inequality in (26) follows from rearrangement of terms and the last inequality follows
upon using the sub-gradient inequality (3) for the convex functional J(·). Now we quantify (in expectation)
each of the error terms appearing on the right-hand side of (26). Applying mirror descent’s complexity
bound (Bubeck 2015, pp. 80) on the K̄-smooth function JN,h (·) and taking expectation, we get
s   1/2
∗ ∗ 1 2 1
E[R2 ] Var(KZ ) + E[KZ2 ]
 
0 ≤ E JN,h (FN,n,k ) − JN,h (FN,n ) ≤ √ . (27)
k ρ k

Next, using Lemma 3 we get the bound on approximation error in (26):


 
3  q 
2 ] + σ (h) .
E ∑ |JN,h (F) − E [JN,h (F)]| ≤ √ diam(F ) E[L Γ,Z 0 (28)
F∈{F ∗ ,F ∗ ,Fn∗ } N
N,n,k N,n

Due to the assumption in (21), we have

sup |E [JN,h (F)] − J(F)| ≤ `1 hβ . (29)


F∈F

And since J is convex, we see that


J(Fn∗ ) − J(F ∗ ) ≤ kSJ (Fn∗ )k∗ kFn∗ − F ∗ k ≤ sup kSJ (F)k∗ kFn∗ − F ∗ k ≤ E [KZ ] g(n), (30)
F∈F

where the last inequality in (30) is from Assumption 2. Now use (27), (28), (29), and (30) to conclude.

71
Zhou, Honnappa, and Pasupathy

REFERENCES
Asmussen, S., and P. W. Glynn. 2007. Stochastic Simulation: Algorithms and Analysis. New York, NY:
Springer.
Banholzer, D., J. Fliege, and R. Werner. 2019. “On rates of convergence for sample average approximations
in the almost sure sense and in mean”. Mathematical Programming 191(1):307–345.
Bartlett, P. L., and S. Mendelson. 2002. “Rademacher and Gaussian complexities: Risk bounds and structural
results”. Journal of Machine Learning Research 3(Nov):463–482.
Bose, A. 1998. “Bahadur representation of Mm estimates”. The Annals of Statistics 26(2):771–777.
Boucheron, S., G. Lugosi, and P. Massart. 2013. Concentration inequalities: A nonasymptotic theory of
independence. Oxford university press.
Bubeck, S. 2015. “Convex Optimization: Algorithms and Complexity”. Foundations and Trends in Machine
Learning 8(3–4):231–358.
Chen, H., and D. D. Yao. 2001. Fundamentals of queueing networks: Performance, asymptotics, and
optimization, Volume 4. Springer.
Dupačová, J., and R. J. B. Wets. 1988. “Asymptotic behavior of statistical estimators and of optimal solutions
of stochastic optimization problems”. The Annals of Statistics 16(4):1517–1549.
Kim, S., R. Pasupathy, and S. G. Henderson. 2014. “A Guide to SAA”. In Encyclopedia of Operations
Research and Management Science, edited by M. Fu, Hillier and Lieberman OR Series. Elsevier.
Kontorovich, A. 2014. “Concentration in unbounded metric spaces and algorithmic stability”. In Proceedings
of the 31st International Conference on Machine Learning. June 22nd -24th , Beijing, China, 28–36.
Kreyszig, E. 1989. Introductory functional analysis with applications. Wiley Classics Library ed. Wiley
classics library. New York: Wiley.
Nesterov, Y. 2004. Introductory Lectures on Convex Optimization: A Basic Course. New York, NY: Springer
Science + Business Media, LLC.
Pasupathy, R. 2010. “On Choosing Parameters in Retrospective-Approximation Algorithms for Stochastic
Root Finding and Simulation Optimization”. Operations Research 58(4):889–901.
Robinson, S. 1996. “Analysis of Sample-path Optimization”. Mathematics of Operations Re-
search 21(3):513–528.
Shapiro, A. 2003. “Monte Carlo sampling methods”. Handbooks in operations research and management
science 10:353–425.
Shapiro, A., D. Dentcheva, and A. Ruszczynski. 2009. Lectures on Stochastic Programming: Modeling
and Theory. 2nd ed. Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics.
van de Geer, S. 2000. Empirical Processes in M-estimation. Cambridge, UK: Cambridge University Press.

AUTHOR BIOGRAPHY
ZIHE ZHOU is a graduate student in the School of Industrial Engineering at Purdue University. Her re-
search interests lie broadly in applied probability and simulation. Her email address is zhou408@purdue.edu.

HARSHA HONNAPPA is Associate Professor in the School of Industrial Engineering at Purdue


University. His research interests lie broadly in applied probability, stochastic optimization, simulation
methodology and machine learning. His email address is honnappa@purdue.edu and his web page is
http://engineering.purdue.edu/SSL.

RAGHU PASUPATHY is Professor of Statistics at Purdue University. His current research interests
lie broadly in general simulation methodology, stochastic optimization, statistical inference and sta-
tistical computation. Raghu Pasupathy’s email address is pasupath@purdue.edu, and his web page
https://web.ics.purdue.edu/∼pasupath contains links to papers, software codes, and other material.

72

You might also like