Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

978-1-6654-7661-4/22/$31.00 ©2022 Ieee 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Proceedings of the 2022 Winter Simulation Conference

B. Feng, G. Pedrielli, Y. Peng, S. Shashaani, E. Song, C.G. Corlu, L.H. Lee, E.P. Chew, T. Roeder, and
P. Lendermann, eds.

EMPIRICAL UNIFORM BOUNDS FOR HETEROSCEDASTIC METAMODELING

Yutong Zhang
Xi Chen

Grado Department of Industrial and Systems Engineering


Virginia Tech
1145 Perry Street
Blacksburg, VA 24061, USA

ABSTRACT
This paper proposes pointwise variance estimation-based and metamodel-based empirical uniform bounds
for heteroscedastic metamodeling based on the state-of-the-art nominal uniform bound available from the
literature by considering the impact of noise variance estimation. Numerical results show that the existing
nominal uniform bound requires a relatively large number of design points and a high number of replications
to achieve a prescribed target coverage level. On the other hand, the metamodel-based empirical bound
outperforms the nominal bound and other competing bounds in terms of empirical simultaneous coverage
probability and bound width, especially when the simulation budget is small. However, the pointwise
variance estimation-based empirical bound is relatively conservative due to its larger width. When the
budget is sufficiently large so that the impact of heteroscedasticity is low, both empirical bounds’ performance
approaches that of the nominal bound.

1 INTRODUCTION
Simulation metamodeling has been widely used to model and analyze complex stochastic systems (Santner
et al. 2003). Heteroscedastic metamodeling has received much attention due to the increasing recognition of
the importance of tackling heteroscedastic noise prevalent in simulation models in various applications, e.g.,
inventory control (Binois et al. 2018) and online management of emerging epidemics (Hu and Ludkovski
2017). Various heteroscedastic metamodeling techniques have emerged over the years, and a considerable
number of them are Gaussian process (GP) related. These include, but are not limited to, the Markov
chain Monte Carlo-based fully Bayesian approach (Goldberg et al. 1997), maximum a posteriori-based
GP (Kersting et al. 2007), stochastic kriging (SK, Ankenman et al. 2010), practical heteroscedastic GP
modeling (Binois et al. 2018), and more recently, variational inference-based heteroscedastic GP modeling
(Wang et al. 2019).
A GP-based modeling approach can provide predictive mean and variance values at each given input
point, based on which one can construct a pointwise confidence bound that covers the true function value
at the given input point with a prescribed probability. However, when one is interested in learning the
true mean values at more than one input point, a simultaneous confidence region covering the true mean
values at an arbitrary set of input points with a prescribed high probability is arguably more desirable.
Existing approaches for constructing a simultaneous confidence region typically rely on bootstrapping and
the Bonferroni (Kleijnen and Van Beers 2022) and Šidák corrections (De Brabanter et al. 2010); moreover,
these methods only apply appropriately when the number of input points is countably many. Recently,
methods dedicated to building (nominal) uniform bounds for heteroscedastic metamodeling have emerged,
e.g., from the functional analysis (Kirschner and Krause 2018) and the GP modeling perspectives (Xie
and Chen 2020). These uniform bounds are intended for containing the true mean values at an arbitrary
set of input points with a prescribed high probability. However, such nominal uniform bounds assume

978-1-6654-7661-4/22/$31.00 ©2022 IEEE 1


Zhang and Chen

the knowledge of the underlying heteroscedastic noise variances and ignore the impact of using variance
estimates, potentially resulting in undercoverage in practice.
In this paper, we propose two approaches for constructing empirical uniform bounds for heteroscedastic
metamodeling by incorporating heteroscedastic noise variance estimation. When the simulation budget is
relatively small, the empirical bounds outperform the nominal bound in terms of simultaneous coverage
probability and bound width. Their performance approaches that of the nominal bound when the budget
becomes large. Section 2 reviews the least square estimation in reproducing kernel Hilbert space, establishes
its connection to SK, and provides its corresponding nominal uniform bound. Section 3 elaborates on
the proposed empirical uniform bounds for heteroscedastic metamodeling. Section 4 provides numerical
evaluations and some conclusions.

2 REVIEW ON LEAST SQUARE ESTIMATION IN REPRODUCING KERNEL HILBERT SPACE


Given a separable Hilbert space H , let V0 : H → H denote a positive definite operator, and denote the inner
product ⟨·, ·⟩V0 := ⟨·, V0 ·⟩ with the corresponding norm || · ||V0 . To estimate an unknown function f ∈ H of
interest, we generate data {vi , {yi j }nj=1 i
}ki=1 , where ni is the replication number allocated to vi , such that yi j =
⟨vi j , f ⟩H + εi j , j = 1, 2, . . . , ni , i = 1, 2, . . . , k, with εi j following a sub-Gaussian distribution with variance
proxy ρi2 . Define the operator M : H → Rk such that for all v ∈ H and i = 1, 2, . . . , k, (M v)i = ⟨vi , v⟩,
and denote its adjoint by M ∗ : Rk → H . Define the k × k diagonal matrix Σ ε := diag(ρ12 /n1 , . . . , ρk2 /nk ),
one can obtain the regularized least squares estimator µ := arg min f ∈H ||M f − ȳ||2Σ −1 + || f ||2V0 . A closed-
ε

form expression follows as µ = V−1 M ∗ Σ −1 −1 ni
ε ȳ, where ȳ = (ȳ1 , ȳ2 , . . . , ȳk ) , ȳi = (ni ) ∑ j=1 yi j , and V =
M ∗ Σ −1 ε M + V0 .
If H is a reproducing kernel Hilbert space (RKHS), a special case of the separable Hilbert space
over Rd with kernel function K : Rd × Rd → R, we have the canonical embedding Kx = K(x, ·) for any
x ∈ X ⊆ Rd , where X is the input space. Let vi = Kxi ∈ H be the embedding of xi ∈ X such that
⟨vi , f ⟩H = ⟨Kxi , f ⟩H = f (xi ). Hence, yi j = f (xi ) + ε j (xi ), where ε j (xi ) satisfies Assumption 1 to be stated
later. Denote the design-point set as D = {x1 , x2 , . . . , xk } with D ⊆ X . Notably, for V0 = κI , where
κ > 0 and I is an identity operator, the representer theorem yields the following tractable form of µ:
µ(x) = ⟨µ, Kx ⟩H = K(x, X)⊤ (K(X, X) + κΣ Σε )−1 ȳ, (1)

where X = x⊤ ⊤ ⊤ ⊤ denotes the k × d design matrix. By abuse of notation, we use K(X, X) to



1 , x2 , . . . , xk
denote the k ×k kernel matrix across the design points with the element in the ith row and the jth column given
by K(X, X)i, j = K(xi , x j ); similarly, K(x, X) denotes the k × 1 vector (K(x, x1 ), K(x, x2 ), . . . , K(x, xk ))⊤ . In
(1), ȳ = (ȳ(x1 ), ȳ(x2 ), . . . , ȳ(xk ))⊤ denotes the k × 1 vector of the sample averages of simulation outputs,
with ȳ(xi ) = f (xi ) + ε̄ ni (xi ) and ε̄ ni (xi ) = n−1 ni
i ∑ j=1 ε j (xi ) denoting the average random noise incurred
at xi for i = 1, 2, . . . , k, which is abbreviated to ε̄(xi ) when there is no risk of confusion. The noise
term ε j (xi ) follows a sub-Gaussian distribution with variance proxy V(xi ) as stated in Assumption 1.
In this case, The noise variance-covariance matrix follows as Σ ε = diag (V(x1 )/n1 , . . . , V(xk )/nk ). By
replacing the operators in the separate Hilbert space with their counterparts in the RKHS, we have
||v||2V−1 = κ −1 (K(x, x) − K(x, X)⊤ (K(X, X) + κΣ Σε )−1 K(x, X)).
Assumption 1 For any x ∈ X ⊆ Rd , ε1 (x), ε2 (x), . . . , are assumed to be independent and identically
distributed (i.i.d.) sub-Gaussian random variables with variance proxy V(x), written as ε j (x) ∼ subG(V(x)),
where supx∈X V(x) < ∞.
We can connect the aforementioned least square estimation in RKHS to heteroscedastic GP modeling,
in particular, stochastic kriging. Let the unknown function f ∼ GP(0, τ 2 K(·, ·)) be a sample from a GP,
where τ 2 = κ −1 denotes the process variance parameter. Assume that the observation noise terms ε j (xi ),
j = 1, 2, . . . , ni , are i.i.d. normally distributed with mean zero and variance V(xi ), for i = 1, 2, . . . , k. Then
the posterior distribution of f (x) at any input point x ∈ X given the data set is normal with the predictive

2
Zhang and Chen

mean as in (1) and the predictive variance given by


 −1 
σ 2 (x) = τ 2 K(x, x) − K(x, X)⊤ K(X, X) + τ −2 Σ ε K(x, X) . (2)

Given this connection, the uniform error bound derived for least square estimation in RKHS can
be expressed in the SK notation, which is stated in Lemma 1 below. Notice that the noise distribution
assumption stipulated by SK is more restrictive than Assumption 1. Nevertheless, the uniform bound
derived in the RKHS setting under Assumption 1 can serve as a uniform bound for SK as well.
Lemma 1 (Lemma 7 in Kirschner and Krause 2018) Under Assumption 1, the following uniform error
bound for the least squares estimator µ(·) given in (1) holds with probability at least 1 − δ for all k ≥ 1
and ∀x ∈ X :
q 
ln |Ik + τ 2 Σ −1 −1

|µ(x) − f (x)| ≤ ε K(X, X)| − 2 ln δ + τ || f ||K σ (x),
| {z }
:=β f (δ )

where Ik is the k × k identity matrix and ∥ f ∥K is the RKHS norm associated with the kernel function K.
Remark 1 A uniform bound for the function f (·) follows from Lemma 1 immediately. It holds with
probability at least 1 − δ for all k ≥ 1 and ∀x ∈ X :

µ(x) − β f (δ )σ (x) ≤ f (x) ≤ µ(x) + β f (δ )σ (x).

This uniform bound can only be regarded as nominal, however, due to its assuming the knowledge of the
noise variances at the design points in µ(·), σ (·), and β f (δ ). In practice, one may replace the true noise
variances with sample variances (Ankenman et al. 2010) and metamodel-based estimates (Wang and Chen
2016; Wang and Chen 2018). Nevertheless, the resulting bounds with variance estimates directly plugged
in may fail to achieve a prescribed coverage probability due to ignoring the impact of noise variance
estimation.

3 MAIN RESULTS
In this section, we propose two approaches to build empirical uniform bounds for heteroscedastic meta-
modeling that take into account the impact of noise variance estimation at all design points. Section 3.1
introduces the pointwise variance estimation-based approach that incorporates time-uniform upper and lower
bounds for the noise variances at the design points. Section 3.2 adopts the metamodel-based approach that
estimates the noise variance function via GP modeling and leverages the corresponding uniform bound for
the noise variance function.

3.1 Pointwise Variance Estimation-Based Empirical Uniform Bound


This section starts with some necessary definitions for introducing time-uniform bounds for pointwise
noise variance estimation in Lemmas 2 and 3. Theorem 1 provides the proposed pointwise variance
estimation-based empirical uniform bound.
Definition 1 (Page 25 in Wainwright 2019) A random variable X ∈ R is said to be sub-exponential with
two given parameters v, c > 0, denoted as subE(v, c), if the following condition is satisfied:
 2 2
λX v λ (− ln(1 − cλ ) − cλ )v
Ee ≤ exp , ∀λ : |λ | < 1/c, or ln Eeλ X ≤ , ∀λ ∈ [0, 1/c). (3)
2 c2
Definition 2 (Definition 1, sub-ψ condition in Howard et al. 2021). Let (S j )∞j=0 , (V j )∞j=0 be real-valued scalar
processes adapted to an underlying filtration (F j )∞j=0 with S0 = V0 = 0 and V j ≥ 0 for all j. For a function

3
Zhang and Chen

ψ : [0, λmax ) → R, we say (S j ) is sub-ψ with variance process (V j ) if, for each λ ∈ [0, λmax ), there exists a
supermartingale (L j (λ ))∞j=0 with respect to (F j ) such that EL0 (λ ) ≤ 1 and exp (λ S j − ψ(λ )V j ) ≤ L j (λ )
almost surely for all j.
The ψ functions can be specified for different distributions such as sub-exponential, sub-Gaussian, and
sub-Gamma, which are interchangeable under some conditions (Howard et al. 2021). The right-hand side
of the second inequality in (3) is an example of ψE —the ψ function for the sub-exponential with parameter
c and v. Such functions are used in computing pointwise uniform bounds for the noise variance V(x) at
each design point x in Lemmas 2 and 3.
Lemma 2 (Theorem 1, the polynomial stitched bound in Howard et al. 2021) Under Assumption 1, the
following time-uniform upper bound for the noise variance V(x) at x ∈ D holds:

∑ni=1 (yi (x) − ȳn−1 (x))2


 
P ∀n ≥ max{2, 1 + Sδ (v)} : V(x) ≤ ≥ 1−δ, (4)
n − 1 − Sδ (v)

where δ√∈ (0, 1), Sδ (v) = k1 vlδ (v) + ck2 lδ (v), lδ (v) = s ln ln(ηvm−1 ) + ln(ζ (s)δ −1 ) lns η, k1 = (η 1/4 +
p

η −1/4 )/ 2, k2 = ( η + 1)/2, ζ (s) is the Riemann zeta function, η > 1, m > 0, and s > 1. Similarly, the
following time-uniform lower bound for the noise variance V(x) at any x ∈ D holds:

∑ni=1 (yi (x) − ȳn−1 (x))2


 
P ∀n ≥ 2 : V(x) ≥ ≥ 1−δ. (5)
n − 1 + Sδ (v)

Lemma 3 (Propositions 6 and 9, the conjugate mixture bound in Howard et al. 2021) Under Assumption 1,
the following time-uniform upper bound for the noise variance V(x) at x ∈ D holds:

∑ni=1 (yi (x) − ȳn−1 (x))2


 
P ∀n ≥ max{2, 1 + NMδ (v)} : V(x) ≤ } ≥ 1−δ, (6)
n − 1 − NMδ (v)
q p
where δ ∈ (0, 1), NMδ (v) = 2(v + ρ) ln((2δ )−1 (v + ρ)ρ −1 + 1), ρ > 0, and v = 32( j − 1). On the
other hand, the following time-uniform lower bound for the noise variance V(x) at x ∈ D holds:

∑n (yi (x) − ȳn−1 (x))2


 
P ∀n ≥ 2 : V(x) ≥ i=1 ≥ 1−δ,
n − 1 + GEδ (v)

where GEδ (v) = sup{s ≥ 0 : m(s, v) < δ −1 },


ρ
( cρ2 ) c2 Γ( v+ρ )γ( v+ρ , cs+v+ρ )
 
c2 c2 c2 cs + v
m(s, v) = v+ρ exp ,
Γ( cρ2 )γ( cρ2 , cρ2 ) ( cs+v+ρ ) c2 c2
c2

c = 4, v = 32(n − 1), ρ > 0, and s ≥ 0.


The proofs of Lemmas 2 and 3 rely on Lemma 5 (in Appendix A) and follow from Example 1 and
the proofs of Theorem 1, Proposition 6 (upper bound), and Proposition 9 (lower bound) given in Howard
et al. (2021). For the sake of brevity, we omit them here.
Remark 2 As stated in Lemmas 2 and 3, special restrictions on the number of replications at each design
point are needed when constructing the pointwise noise variance upper bounds according to (4) and (6)
because n − 1 − Sδ (v) and n − 1 − NMδ (v) must be positive to ensure correct inequality directions. Since
Sδ (v) and NMδ (v) depend on c and v which are determined by the noise variance assumption stipulated,
there is a minimum number of replications at each design point implicitly required by the noise variance
assumption stipulated for constructing these pointwise time-uniform noise variance upper bounds.

4
Zhang and Chen

Remark 3 A common assumption stipulated on the noise variance in the literature is that ε j (x)’s are i.i.d.
normal (Howard et al. 2021), which is more restrictive than Assumption 1. The normality of ε j (x)’s leads
to lower values of c and v compared to the sub-Gaussianity stipulated in Assumption 1; consequently,
fewer replications at each design point are required for constructing appropriate pointwise time-uniform
noise variance upper bounds under normality.
In light of Lemmas 2 and 3 and Lemma 6 in Appendix A, we are now in a position to construct
empirical uniform bounds for the function of interest.
Theorem 1 Denote µ b (·) as the empirical predicted mean function obtained by using Σ b ε = diag(V(x
b 1 )/n1 ,
V(x2 )/n2 , . . . , V(xk )/nk ) in lieu of Σ ε in (1). Under Assumption 1, the following bound holds with probability
b b
at least 1 − δ1 − δ2 − δ3 for all k ≥ 1 and ∀x ∈ X :

b (x) − f (x)| ≤ τ −1 || f ||K σb + βbk σ (x),


where δ1 is the error probability level for the nominal uniform bound of f (·), δ2 and δ3 are respectively the
error probability levels allocated for building the lower and upper time-uniform bounds of noise variances
at all design points,
q
σb (x) = τ K(x, x) − K(x, X)⊤ (K(X, X) + τ −2 Σ b ε )−1 K(x, X), Σ
b ε = diag(V(x
b 1 )/n1 , . . . , V(x
b k )/nk ),
q
βbk = 2 ln(δ1−1 det(Ik + τ 2 Σ ε K(X, X))), Σ ε = diag(V(x1 )/n1 , . . . , V(xk )/nk ), (7)
q
σ (x) = τ K(x, x) − K(x, X)⊤ (K(X, X) + τ −2 Σ ε )−1 K(x, X), Σ ε = diag(V(x1 )/n1 , . . . , V(xk )/nk ), (8)

V(x
b i ) is the sample variance at xi , i = 1, 2, . . . , k, and V(xi ) and V(xi ) are respectively the time-uniform
lower and upper bounds for V(xi ) given in (5) and (6).

Proof. Denote A = K(X, X) + τ −2 Σ ε , Â = K(X, X) + τ −2 Σ b ε , and we can write µ b (x) = K(x, X)⊤ Â−1 ȳ =
K(x, X)⊤ Â−1 (f + ε̄ε ), where f = ( f (x1 ), f (x2 ), . . . , f (xk ))⊤ and ε̄ε = (ε̄(x1 ), ε̄(x2 ), . . . , ε̄(xk ))⊤ denote the
k × 1 vector of true function values and the k × 1 vector of average noise terms at the k design points.
Hence, we have
b (x) − f (x)| = |K(x, X)Â−1 (f + ε̄ε ) − f (x)| ≤ |K(x, X)⊤ Â−1 f − f (x)| + |K(x, X)⊤ Â−1 ε̄ε |,
|µ (9)
| {z } | {z }
:=Ua :=Ub

Following Appendix C of Chowdhury and Gopalan (2017), the term Ua in (9) can be upper bounded
as follows:

Ua = |φ (x)⊤ f − φ (x)⊤ Φ ⊤ Φk Φ ⊤
k (Φ
−2 b −1
k + τ Σε ) Φk f |
Φ⊤ Φ k + τ −2 Σ
= |φ (x)⊤ f − φ (x)⊤ (Φ k
b ε )−1 Φ ⊤ Φ k f |
k
⊤ ⊤ −2 b −1 −2 b
= |φ (x) (ΦΦk Φ k + τ Σ ε ) τ Σ ε f |
≤ ⊤
Φ⊤
||φ (x) (Φ −2 b −1 b −1
k Φ k + τ Σ ε ) Σ ε ||K τ || f ||K
q
= ∥ f ∥K Φ⊤
τ −2 φ (x)⊤ (Φ −2 b −1 b −2 b Φ ⊤ Φ + τ −2 Σ
k Φ k + τ Σ ε ) Σ ε τ Σ ε (Φ k k
b ε )−1 φ (x)
q
Φ⊤
≤ ∥ f ∥K τ −2 φ (x)⊤ (Φ −2 b −1 b Φ ⊤ Φ + τ −2 Σ
k Φ k + τ Σ ε ) Σ ε (Φ k k Φ⊤
b ε )(Φ −2 b −1
k Φ k + τ Σ ε ) φ (x)
q
Φ⊤
= ∥ f ∥K τ −2 φ (x)⊤ (Φ −2 b −1 b
k Φ k + τ Σ ε ) Σ ε φ (x)
= τ −1 || f ||K σb (x),

5
Zhang and Chen

where Φ k and φ (x) are defined in Lemma 6 in Appendix A. The term Ub in (9) can be upper bounded as:

Ub = K(x, X)⊤ (Â−1 A − Ik )A−1 + A−1 ε̄ε ≤ K(x, X)⊤ (Â−1 A − Ik )A−1 ε̄ε + K(x, X)⊤ A−1 ε̄ε . (10)

| {z }
:=Uc

Regarding Uc , Â → A almost surely as ni → ∞, ∀i = 1, 2, . . . , k. By Weyl’s eigenvalue perturbation theorem,


the eigenvalues of Â−1 A are consistent estimators of Ik . Hence, for ∀ε ′ > 0, ∃N0 > 0, such that for all
ni > N0 , i = 1, 2, . . . , k, P(A ) ≥ 1 − ε ′ , where A := |λmax (Â−1 A − Ik )| ≤ ε ′ . Hence, we have

P(Uc ≤ |ε ′ K(x, X)⊤ A−1 ε̄|) ≥ 1 − ε ′ , (11)

when ni is sufficiently large for i = 1, 2, . . . , k. Plugging (11) back into (10) yieldsUb ≤ (1+ε ′ )|K(x, X)⊤ A−1 ε̄|.
By the proof of Lemma 1, the following bound holds with probability at least 1 − δ1 for all k ≥ 1 and
∀x ∈ X : q 
⊤ −1 2 −1

K(x, X) A ε̄ ≤ ln |Ik + τ Σ ε K(X, X)| − 2 ln δ1 σ (x). (12)
| {z }
:=βk (δ1 )
q
By Lemma 6, βk (δ1 ) can be upper bounded by βbk (δ1 ) := ln |Ik + τ 2 Σ −1

ε K(X, X)| − 2 ln δ1 , which is abbre-
viated to βbk for ease of notation; this upper bound is obtained by replacing Σ ε with Σ ε := diag (V(x1 )/n1 , . . . ,
V(xk )/nk ), where the V(xi )’s are the pointwise time-uniform lower bounds of the noise variances at the k
design points obtained with error probability δ2 /k each; q hence, the confidence level is (1 − δ2 /k)k ≥ 1 − δ2 .
Similarly, σ (x) can be upper bounded by σ (x) := τ K(x, x) − K(x, X)⊤ (K(X, X) + τ −2 Σ ε )−1 K(x, X)
which is obtained by replacing Σ ε with Σ ε . The error probability allocated for constructing the pointwise
time-uniform upper bound of the noise variance at each design point is δ3 /k, hence the confidence level is
(1 − δ3 /k)k ≥ 1 − δ3 . Finally, by combining the bounds for Ua and Ub with (9) and taking ε ′ → 0, we have
b (x) − f (x)| ≤ τ −1 || f ||K σb (x) + βbk σ (x) with probability at least (1 − δ1 )(1 − δ2 )(1 − δ3 ) which is further

bounded below by 1 − δ1 − δ2 − δ3 .
Remark 4 Recall that Lemmas 2 and 3 provide two types of methods for constructing the pointwise
time-uniform lower and upper bounds of the noise variance at a given design point. Theorem 1 selects the
stitched lower bound because this bound has a closed form while the conjugate bound does not. Similarly,
Theorem 1 adopts the conjugate upper bound because this bound requires fewer replications at each design
point for the time-uniform upper bound construction.

3.2 Metamodel-Based Empirical Uniform Bound


This section proposes a metamodel-based empirical uniform bound. Specifically, we adopt a GP for
modeling the logarithm of the noise variance function and construct its corresponding uniform bound
for the noise variance function. Let r(x) := ln(V(x)),
b which is modeled as r(x) = g(x) + εg (x), where
g(x) := ln(V(x)) ∼ GP(0, τg Kg (·, ·)) and εg (x) is sub-Gaussian with R2 as the variance proxy.
2

Lemma 4 (Theorem 2 in Chowdhury and Gopalan 2017) Let g : X → R be a member of the RKHS
of real-valued functions on X with kernel Kg . The following uniform error bound for g(·) holds with
probability at least 1 − δ for all k ≥ 1 and ∀x ∈ X :
 s 
 2 
R τg
|ĝ(x) − g(x)| ≤  ln |Ik + 2 Kg (X, X)| + 2 + 2 ln(1/δ ) + τg−1 ∥g∥Kg σg (x),
τg R
| {z }
:=βg (δ )

6
Zhang and Chen

where ĝ(x) and σg2 (x) are the predictive mean and the predictive variance of g(x) at x, respectively given
by

ĝ(x) = Kg (x, X)⊤ (Kg (X, X) + λ Ik )−1 r,


 −1 
σg2 (x) = τg2 Kg (x, x) − Kg (x, X)⊤ Kg (X, X) + τg−2 R2 Ik Kg (x, X) ,
 ⊤
where r = (r(x1 ), r(x2 ), . . . , r(xk ))⊤ = ln V(x
b 1 ), ln V(x
b 2 ), . . . , ln V(x
b k) .

Remark 5 In light of Lemma 4, the following uniform bound for the noise variance function V(·) holds
with probability at least 1 − δ for all k ≥ 1 and ∀x ∈ X :

Ṽ(x)/ exp (βg (δ )σg (x)) ≤ V(x) ≤ Ṽ(x) exp (βg (δ )σg (x)), (13)
| {z } | {z }
:=Vl :=Vu

where Ṽ(x) =: exp(ĝ(x)) is the predictive noise variance at any x ∈ X .


Theorem 2 Denote µ b (·) as the empirical predicted mean function obtained by using Σ b ε = diag(Ṽ(x1 )/n1 ,
Ṽ(x2 )/n2 , . . . , Ṽ(xk )/nk ) in lieu of Σ ε in (1). Under Assumption 1, the following bound holds with probability
at least 1 − δ1 − δ2 for all k ≥ 1 and ∀x ∈ X :

b (x) − f (x)| ≤ τ −1 || f ||K σb + βbk σ (x),


|µ (14)

where δ1 denotes the error probability level for the nominal uniform bound of f (·) and δ2 is the error
probability level for the uniform bound of V(·). The terms σb , βbk , and σ (x) are defined similarly as in
Theorem 1 but with V(x) and V(x) at each design point x ∈ X respectively replaced by Vl and Vu defined
in (13).
Proof. The proof is identical to that of Theorem 1 until upper bounding the term Uc . According to
Lemma 4, we have for ∀δ2 > 0, P(B) ≥ 1 − δ2 , where B := {|Ṽ(xi ) − V(xi )| ≤ Ṽ(xi )βVi , i = 1, 2, . . . , k}
with βVi = max{exp (βg (δ )σg (xi )) − 1, exp (−βg (δ )σg (xi )) − 1}. For ∀ni ≥ 2, i = 1, 2, . . . , k, we have that
B ′ := {|Ṽ(xi ) − V(xi )|/ni ≤ Ṽ(xi )βVi /ni , i = 1, 2, . . . , k} is true almost surely on B. Then, for ∀ε ′ > 0,
∃N0 > 0, such that for ∀ni ≥ N0 , i = 1, 2, . . . , k, the event A := |λmax (Â−1 A − Ik )| ≤ ε ′ is true almost surely
on B ′ . Hence, we have P(A ) = P(A |B ′ )P(B ′ |B)P(B) ≥ 1 × 1 × (1 − δ2 ) = 1 − δ2 . The remaining
proof of Theorem 2 proceeds as that of Theorem 1. The only difference is that we adopt metamodel-based
uniform upper and lower bounds for the noise variance function estimation. By Lemma 6 in Appendix A,
βk (δ1 ) and σ (x) can be simultaneously upper bounded by βbk and σ (x) as given in (7) and (8) with V(x)
and V(x) replaced by Vl and Vu defined in (13) with probability at least 1 − δ2 . Combining the error
probabilities allocated to (12) and (13), we have (14) hold with probability (1 − δ1 )(1 − δ2 ), which is further
bounded below by 1 − δ1 − δ2 .

4 NUMERICAL EVALUATIONS
In this section, we adopt the following M/M/1 queueing example to demonstrate the performance of
the empirical uniform bounds proposed in Sections 3.1 and 3.2. The input variable is the arrival rate,
x ∈ X ⊆ (0, 1), and the mean response surface of interest is the steady-state mean number of customers
in the queue, f (x) = x/(1 − x). The noise variance function is V(x) ≈ 2x(1 + x)/(T (1 − x)4 ) for T large
and we set T = 1000 in this example.
Experimental setup. The input space X = [0.3, 0.9]. The design-point set comprises k equispaced points
in X with the value of k varying in K = {8, 16, 32, 64, 128, 256}. An equal number of replications
ni is used at all k design points and we consider two replication allocation sets: the low-replication set

7
Zhang and Chen

where ni ∈ {5, 10, 15} and the high-replication set where ni ∈ {45, 55, 65}. Notice that, to be constructed
appropriately, the pointwise uniform bounds for noise variances require that ni be sufficiently large.
Methods in comparison. Two benchmarking methods are considered to be compared with the proposed
pointwise variance estimation-based empirical uniform bound (denoted by CIp ) and the metamodel-based
empirical uniform bound (denoted by CIm ). The first one is to apply the Bonferroni correction to construct the
uniform bound (denoted by CIb ) that holds simultaneously at all prediction points of interest (Kleijnen and
Van Beers 2022). The other corresponds to the nominal uniform bound for heteroscedastic metamodeling
(denoted by CIu ) as in Lemma 1 but with the unknown noise variances at the design points directly replaced
by the corresponding sample variances.
Implementation configurations. The prediction-point set comprises a grid of N = 1000 equispaced points
in X . For constructing the pointwise variance estimation-based empirical bound, we adopt η = 2, s = 1.4,
and m = 1 in building the lower stitched bounds given in (5), as suggested by Howard et al. (2021). In
building the upper conjugate bounds given in (6), we use ρ = 5, which requires the smallest number of
replications at each design point. The overall error probability level for a given uniform bound of f (·) is
set to 0.05. Regarding the two replication allocation sets, the low-replication set is used for constructing
CIb , CIu , and CIp , and the high-replication set is applied to constructing all four bounds.
Evaluation metrics. To assess the four bounds’ performance under each experimental setting, we adopt two
metrics: the empirical simultaneous coverage probability (SCP, Xie and Chen 2020) and the average interval
width (AIW, Lam and Zhang 2021) of a uniform bound, obtained over a total of M = 100 macro-replications.
The SCP is defined as follows:
1 M
SCP = ∑ 1{ f (x0,i ) ∈ CI(x0,i ) for i = 1, 2, . . . , N on the mth macro-replication},
M m=1

where CI(x0,i ) denotes a given bound at prediction point x0,i . The AIW obtained on each macro-replication
is given by
1 N
AIWm = ∑ (Um (x0,i ) − Lm (x0,i )) , m = 1, 2, . . . , M,
2N i=1
where Um (x0,i ) and Lm (x0,i ) are the upper and lower limits of a given bound at x0,i on the mth macro-replication.
Summary of results. The numerical results of the M/M/1 queueing example are shown in Tables 1 to 4.
Tables 1 and 2 give the SCP results obtained using the low- and the high-replication sets, and Tables 3 and
4 provide a summary of the AIWs.
Table 1 shows the SCPs achieved by CIb , CIu , and CIm using the low-replication set. The following
observations are made. First, the Bonferroni bounds have the lowest SCPs, followed by the nominal uniform
bounds, and the proposed metamodel-based empirical bounds have the best SCPs. Second, given a fixed
number of design points k, the SCPs of all bounds tend to increase with ni ; similarly, given a fixed number
of replications ni , the SCPs tend to increase with k. Lastly, the Bonferroni bounds never achieve the target
level of 0.95, while the nominal uniform bounds can achieve the target level when k and ni are relatively
large. In contrast, the metamodel-based empirical bounds can almost always achieve the target level under
the settings considered in Table 1.
Table 2 shows the SCPs achieved by all four bounds using the high-replication set. We have the
following observations. First, as seen in Table 1, the Bonferroni bounds yield the lowest SCPs, followed
by the nominal uniform bounds and the metamodel-based uniform bounds, and the pointwise variance
estimation-based empirical bounds deliver the highest SCPs. Second, due to the relatively large number of
replications allocated to each design point, except for the Bonferroni bounds, the SCPs for the other three
bounds are always higher than the target level. Third, the SCPs for the pointwise variance estimation-based
empirical bounds are always 1; we conjecture that such conservativeness is caused by their large widths,
which will be examined next.

8
Zhang and Chen

Table 1: The SCPs achieved by the Bonferroni bounds (CIb ), the nominal uniform bounds (CIu ), and the
metamodel-based empirical uniform bounds (CIm ), obtained using the low-replication set.
ni 5 10 15
k CIb CIu CIm CIb CIu CIm CIb CIu CIm
8 0.15 0.46 0.95 0.23 0.67 0.91 0.31 0.74 0.95
16 0.26 0.72 1 0.28 0.8 0.99 0.25 0.85 0.95
32 0.19 0.78 0.98 0.4 0.94 0.99 0.36 0.89 1
64 0.24 0.87 0.99 0.26 0.96 1 0.44 0.98 1
128 0.17 0.89 1 0.31 0.98 1 0.33 0.98 1
256 0.15 0.88 1 0.32 0.96 1 0.36 0.99 1

Table 2: The SCPs achieved by CIb , CIu , CIm , and the pointwise variance estimation-based empirical bound
(CIp ), obtained using the high-replication set.
ni 45 55 65
k CIb CIu CIm CIp CIb CIu CIm CIp CIb CIu CIm CIp
8 0.48 0.96 0.96 1 0.65 1 0.98 1 0.57 0.98 0.99 1
16 0.38 0.95 1 1 0.45 1 1 1 0.41 0.98 0.99 1
32 0.53 0.99 1 1 0.44 1 0.99 1 0.5 1 1 1
64 0.46 0.98 1 1 0.42 0.99 1 1 0.35 1 0.99 1
128 0.36 1 1 1 0.41 1 1 1 0.37 1 1 1
256 0.43 1 1 1 0.46 1 1 1 0.45 1 1 1
Table 3 summarizes the AIWs obtained under the same setting as considered in Table 1. We have the
following observations. First, the Bonferroni bounds have the smallest AIWs, followed by the nominal
uniform bounds and the metamodel-based empirical bounds when using the low-replication set. Second,
as k or ni increases, the AIWs of all bounds show a decreasing trend. Third, the differences in the AIWs
of different bounds diminish as k or ni increases.
Table 4 summarizes the AIWs obtained under the same settings as considered in Table 2. We have
the following observations. First, the Bonferroni bounds have the smallest AIWs, followed by the nominal
uniform bounds and the metamodel-based empirical bounds. The pointwise variance estimation-based
empirical bounds have the largest AIWs, which verifies our conjecture made from Table 2. Second, as k
or ni increases, the AIWs of all four bounds tend to decrease.
In brief, we conclude with the following comments. The Bonferroni bounds yield the worst performance—
their SCPs are typically too low to be satisfactory. The nominal uniform bounds require large k and ni
to achieve the target coverage level. The metamodel-based empirical bounds can almost always achieve
the target level, and as k and ni become large, the difference in their widths from those of the nominal
bounds diminishes. The pointwise variance estimation-based empirical bounds are conservative in giving
high SCPs due to their large widths, especially when k and ni are small.

ACKNOWLEDGMENTS
This paper is based upon work supported by the National Science Foundation [IIS-1849300] and NSF
CAREER [CMMI-1846663].

9
Zhang and Chen

Table 3: Summary of the mean and standard error (in parentheses) of the AIWs of CIb , CIu , and CIm
obtained using the low-replication set; the symbol “*” indicates that the value in the cell is less than 0.01.
ni 5 10 15
k CIb CIu CIm CIb CIu CIm CIb CIu CIm
0.57 1.08 4.8 0.63 1.35 2.72 0.93 2.2 2.06
8
(0.01) (0.03) (0.40) (0.14) (0.39) (0.48) (0.24) (0.63) (0.41)
0.45 0.94 3.85 0.37 0.8 1.56 0.34 0.73 1.26
16
(0.01) (0.03) (0.51) * (0.01) (0.04) * (0.01) (0.03)
0.36 0.81 3.45 0.3 0.71 1.36 0.26 0.6 0.98
32
(0.01) (0.02) (0.59) * (0.01) (0.04) * (0.01) (0.02)
0.27 0.65 2.67 0.22 0.53 1.10 0.2 0.51 0.85
64
* (0.02) (0.11) * (0.01) (0.03) * (0.01) (0.02)
0.21 0.55 2.18 0.17 0.44 0.89 0.15 0.39 0.64
128
* (0.02) (0.08) * (0.01) (0.02) * * (0.01)
0.16 0.43 1.69 0.13 0.34 0.73 0.11 0.31 0.50
256
* (0.01) (0.08) * * (0.02) * * (0.01)

Table 4: Summary of the mean and standard error (in parentheses) of the AIWs of all four bounds obtained
using the high-replication set; the symbol “*” indicates that the value in the cell is less than 0.01.
ni 45 55 65
k CIb CIu CIm CIp CIb CIu CIm CIp CIb CIu CIm CIp
1 2.67 1.67 3.37 1.68 4.65 2.91 5.29 1.13 3.11 3.37 3.57
8
(0.3) (0.85) (0.49) (0.89) (0.4) (1.13) (0.80) (1.17) (0.31) (0.9) (0.86) (0.93)
0.23 0.56 0.68 1.14 0.22 0.53 0.62 0.94 0.2 0.51 0.58 0.82
16
* (0.01) (0.01) (0.02) * (0.01) (0.01) (0.01) * (0.01) (0.01) (0.01)
0.18 0.46 0.55 1.06 0.16 0.42 0.48 0.79 0.15 0.4 0.44 0.68
32
* (0.01) (0.01) (0.01) * * (0.01) (0.01) * * * (0.01)
0.13 0.36 0.42 0.94 0.12 0.33 0.37 0.68 0.11 0.31 0.35 0.55
64
* * * (0.01) * * * (0.01) * * * *
0.1 0.27 0.33 0.91 0.09 0.26 0.29 0.58 0.09 0.25 0.27 0.46
128
* * * (0.01) * * * (0.01) * * * *
0.07 0.22 0.26 1.12 0.07 0.2 0.23 0.5 0.06 0.19 0.22 0.39
256
* * * (0.01) * * * * * * * *

10
Zhang and Chen

A APPENDIX
Lemma 5 Under Assumption 1, given x ∈ D, for ∀n ≥ 2, define Sn = V(x)−1 ∑ni=1 (yi (x) − ȳn (x))2 − (n − 1).
Then Sn is sub-exponential with its cumulant generating function (CGF) upper bounded as follows:
(− ln(1 − cλ ) − cλ )v
ln Eeλ Sn ≤ ψE := , ∀λ ∈ [0, 1/c),
c2
where c = 4 and v = 32(n − 1).
Proof. Rewrite εi (x) = V(x)Zi , where the Zi ’s are i.i.d. subG(1). For n ≥ 2, Sn = V(x)−1 ∑ni=1 (yi (x) −
ȳn (x)) −(n−1) = V(x)−1 ∑ni=1 (εi (x)− ε̄n (x))2 −(n−1) = ∑ni=1 (Zi − Z̄n )2 −(n−1), where ȳn (x) = n−1 ∑ni=1 yi (x),
2

ε̄n (x) = n−1 ∑ni=1 εi (x), and Z̄n = n−1 ∑ni=1 Zi . Direct calculations yield ∆Sn = Sn − Sn−1 = n−1 2
n (Zn − Z̄n−1 ) −
p
1 =: Yn2 − 1, where Yn := (n − 1)/n(Zn − Z̄n−1 ) for n ≥ 2. Notice that Zn ∼ subG(1) is independent of
Z̄n−1 ∼ subG((n − 1)−1 ). Hence, for all 2 ≤ j ≤ n, Y j ∼ subG(1), which leads to ∆S j ∼ subE(32, 4). It
follows that
E (exp(λ Sn )) = E (exp (λ (∆S2 + ∆S3 + · · · + ∆Sn )))
   32    32
(n − 1)32 (n−1)32 (n − 1)32 (n−1)32
≤ E exp λ ∆S2 · · · E exp λ ∆Sn
32 32
   32    32
1 2 2
(n−1)32 1 2 2
(n−1)32
≤ exp λ (32(n − 1)) · · · exp λ (32(n − 1))
2 2
 
1 2
= exp λ (32(n − 1))2 , ∀λ ∈ [0, 1/4),
2
where the first inequality follows from the generalized Hölder’s inequality. Hence, Sn is subE(32(n − 1), 4).
According to (3), we see that the CGF of Sn is upper bounded by ψE with c = 4 and v = 32(n − 1).
Lemma 6 Denote h(V(x1 ), . . . , V(xk )) = ln(det(Ik +τ 2 Σ −1
ε K(X, X))), where recall that Σ ε = diag(V(x1 )/n1 ,
. . . , V(xk )/nk ). Then h is nonincreasing in V(xi ), i = 1, 2, . . . , k.
Proof. The proof is in the same vein as that in Appendix B of Chowdhury and Gopalan (2017). Denote
Bk = Ik + τ 2 Φ ⊤ −1 ⊤ ⊤
k Σ ε Φ k , where Φ k = (φ (x1 ), . . . , φ (xk )) is a k × ∞ matrix which satisfies K(X, X) = Φ k Φ k ,
and φ (x) equals K(x, ·), which is an operator mapping any input point x in X to H associated with kernel
function K. It follows that
Bk Φ ⊤ 2 −1
Φ⊤
k = (Ik + τ Σ ε K(X, X))Φ
⊤ 2 ⊤ −1 ⊤
k = Φk + τ Φk Σε Φk Φk
= Φ⊤ 2 −1 ⊤ ⊤ 2 −1
k (Ik + τ Σ ε Φ k Φ k ) = Φ k (Ik + τ Σ ε K(X, X)).

Φ⊤
By det(AB) = det(A) det(B) for two matrices A and B with compatible dimensions, we have det(Bk ) det(Φ k )=
⊤ 2 −1
Φk )· det(Ik + τ Σ ε K(X, X)). Hence, it follows that
det(Φ
det(Bk ) = det(Ik + τ 2 Σ −1
ε K(X, X)). (15)
Plugging (15) into h(V(x1 ), . . . , V(xk )) yields
nk
h(V(x1 ), . . . , V(xk )) = ln(det(Bk )) = ln(det(Bk−1 + τ 2 φ (xk )φ (xk )⊤ ))
V(xk )
nk
= ln(det(Bk−1 ) det(1 + τ 2 ||φ (xk )||B−1 ))
V(xk ) k−1

k
ni ni
= ln(det(B0 )Πki=1 (1 + τ 2 ||φ (xk )||B−1 )) = ∑ ln(1 + τ 2 Ki−1 (xi , xi )),
V(xi ) i−1
i=1 V(xi )

11
Zhang and Chen

where B0 = Ik and Ki−1 (xi , xi ) is the predictive variance at xi obtained using all observations up to the
(i − 1)th design point. Hence, h(V(x1 ), . . . , V(xk )) is nonincreasing in the V(xi )’s.

REFERENCES
Ankenman, B., B. L. Nelson, and J. Staum. 2010. “Stochastic kriging for simulation metamodeling”. Operations Research 58:371–
382.
Binois, M., R. B. Gramacy, and M. Ludkovski. 2018. “Practical heteroscedastic Gaussian process modeling for large simulation
experiments”. Journal of Computational and Graphical Statistics 27:808–821.
Chowdhury, S. R., and A. Gopalan. 2017. “On kernelized multi-armed bandits”. In International Conference on Machine
Learning, 844–853. PMLR.
De Brabanter, K., J. De Brabanter, J. A. Suykens, and B. De Moor. 2010. “Approximate confidence and prediction intervals
for least squares support vector regression”. IEEE Transactions on Neural Networks 22:110–120.
Goldberg, P., C. Williams, and C. Bishop. 1997. “Regression with input-dependent noise: A Gaussian process treatment”.
Advances in neural information processing systems 10:1–7.
Howard, S. R., A. Ramdas, J. McAuliffe, and J. Sekhon. 2021. “Time-uniform, nonparametric, nonasymptotic confidence
sequences”. The Annals of Statistics 49:1055–1080.
Hu, R., and M. Ludkovski. 2017. “Sequential design for ranking response surfaces”. SIAM/ASA Journal on Uncertainty
Quantification 5:212–239.
Kersting, K., C. Plagemann, P. Pfaff, and W. Burgard. 2007. “Most likely heteroscedastic Gaussian process regression”. In Pro-
ceedings of the 24th international conference on Machine learning, 393–400. Corvallis, OR: International MachineLearning
Society.
Kirschner, J., and A. Krause. 2018. “Information directed sampling and bandits with heteroscedastic noise”. In Proceedings of
Machine Learning Research, 358–384. PMLR.
Kleijnen, J. P., and W. C. Van Beers. 2022. “Statistical tests for cross-validation of kriging models”. INFORMS Journal on
Computing 34:607–621.
Lam, H., and H. Zhang. 2021. “Neural predictive intervals for simulation metamodeling”. In Proceedings of the 2021 Winter
Simulation Conference, edited by S. Kim, B. Feng, K. Smith, S. Masoud, Z. Zheng, C. Szabo, and M. Loper. Phoenix,AZ:
Institute of Electrical and Electronics Engineers, Inc.
Santner, T. J., B. J. Williams, and W. I. Notz. 2003. The design and analysis of computer experiments. New York: Springer.
Wainwright, M. J. 2019. High-dimensional statistics: A non-asymptotic viewpoint, Volume 48. Cambridge University Press.
Wang, W., N. Chen, X. Chen, and L. Yang. 2019. “A variational inference-based heteroscedastic Gaussian process approach
for simulation metamodeling”. ACM Transactions on Modeling and Computer Simulation 29:1–22.
Wang, W., and X. Chen. 2016. “The effects of estimation of heteroscedasticity on stochastic kriging”. In Proceedings of the
2016 Winter Simulation Conference, edited by P. Frazier, R. Roeder, T.M. andSzechtman, and E. Zhou, 326–337. Institute
of Electrical and Electronics Engineers, Inc.
Wang, W., and X. Chen. 2018. “An adaptive two-stage dual metamodeling approach for stochastic simulation experiments”.
IISE Transactions 50:820–836.
Xie, G., and X. Chen. 2020. “Uniform error bounds for stochastic kriging”. In Proceedings of the 2020 Winter Simulation
Conference, edited by K.-H. Bae, B. Feng, S. Kim, S. Lazarova-Molnar, Z. Zheng, T. Roeder, and R. Thiesing, 361–372.
Institute of Electrical and Electronics Engineers, Inc.

AUTHOR BIOGRAPHIES
YUTONG ZHANG is a Ph.D. candidate in the Grado Department of Industrial and Systems Engineering at Virginia Tech. Her
research interest lies in machine learning, stochastic modeling, and simulation methodology. Her email address is yutongz@vt.edu
and her webpage is https://yutong2018.github.io/.

XI CHEN is an associate professor in the Grado Department of Industrial and Systems Engineering at Virginia Tech. Her research
interests include simulation modeling and analysis, applied probability and statistics, computer experiment design and analysis, and
simulation optimization. Her email address is xchen6@vt.edu and her web page is https://sites.google.com/vt.edu/xi-chen-ise/home.

12

You might also like