21 Ejs1809
21 Ejs1809
21 Ejs1809
Yi Yu
Daren Wang
Alessandro Rinaldo
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155
1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 1156
1.2 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . 1158
2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159
3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1160
3.1 The consistency of the Kolmogorov–Smirnov detector algorithm 1160
3.2 Phase transition and minimax optimality . . . . . . . . . . . . 1163
3.3 Choice of tuning parameters . . . . . . . . . . . . . . . . . . . . 1164
4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . 1171
A Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173
B Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175
C Proofs of Section 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
D Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
E Sensitivity simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 1196
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197
1. Introduction
In this section we describe the change point model that we will consider. Our
notation and settings are fairly standard, with one crucial difference from most
of the contributions in the field: the changes in the underlying distribution at
Optimal nonparametric change point analysis 1157
the change points are not parametrically specified, but are instead quantified
through a nonparametric measure of distance between distributions. This fea-
ture renders our methods and analysis applicable to a wide range of change
point problems.
Condition 1.1 (Model). Let {Yt,i , t = 1, . . . , T, i = 1, . . . , nt } ⊂ R be a collec-
tion of independent random variables such that Yt,i ∼ Ft , where F1 , . . . , FT are
cumulative distribution functions.
Let {ηk }K+1
k=0 ⊂ {0, 1, . . . , T } be a collection of unknown change points with
1 = η0 < η1 < . . . < ηK ≤ T < ηK+1 = T + 1 such that Ft = Ft−1 , if and only
if t ∈ {η1 , . . . , ηK }.
The minimal spacing δ and the jump size κ are defined respectively as
mink=1,...,K+1 {ηk − ηk−1 } = δ > 0 and
min supFηk (z) − Fηk −1 (z) = min κk = κ > 0. (1.1)
k=1,...,K z∈R k=1,...,K
We will show that under Condition 1.1, the hardness of the change point local-
ization task is fully captured by the quantity
We point out that, although in deriving the theoretical guarantees for our
methodologies we follow techniques proposed in existing work, namely Venka-
traman (1992) and Fryzlewicz (2014), our results deliver improvements in two
aspects. Firstly, the extension to nonparametric settings, in which the mag-
nitude of the distributional changes is measured by the Kolmogorov–Smirnov
distance, requires novel and nontrivial arguments, especially to quantify the or-
der of the stochastic fluctuations of the associated CUSUM statistics. Secondly,
the arguments used in Fryzlewicz (2014) for the theoretical analysis of the per-
formance of the WBS algorithm have to be sharpened in order to allow for all
the model parameters to vary as the sample size diverges and in order to yield
optimal localization rates.
2. Methodology
with
ns:t n(t+1):e
t
Ds,e (z) = Fs:t (z) − F(t+1):e (z) ,
ns:e
where for any integer pair (s, e) with s < e and z ∈ R, we write
1
e nt e
Fs:e (z) = 1{Yt,i ≤z} and ns:e = nt .
ns:e t=s i=1 t=s
threshold, or when the resulting time interval is too narrow. See Algorithm 1
for details.
rameter τ > 0.
for m = 1, . . . , M do
(sm , em ) ← (s, e) ∩ (αm , βm )
if em − sm > 2 then
am ← maxt=sm +1,...,em −1 Dst m ,em
bm ← arg maxt=sm +1,...,em −1 Dst m ,em
else
am ← −1
end if
end for
m∗ ← arg maxm=1,...,M am
if am∗ > τ then
Add bm∗ to the set of estimated change points
KSD ((s, bm∗ ), {(αm , βm )}Mm=1 , τ ) Recursively call the function KSD
KSD ((bm∗ + 1, e), {(αm , βm )}M m=1 , τ )
end if
OUTPUT: The set of estimated change points.
3. Theory
We first state a condition involving the minimal spacing, the minimal jump
size and the total number of time points T , which overall amount to a minimal
signal-to-noise ratio condition.
Condition 3.1. There exists an absolute constant CSNR > 0 such that
The scaling exhibited in Condition 3.1 covers nearly all combinations of model
parameters for which the localization task is feasible: in Section 3.2 we show that
no estimator of the change points is guaranteed to be consistent when the model
parameters violate Condition 3.1, up to a poly-logarithmic term in T .
We then show that, under Condition 3.1 and with appropriate choices of
the input parameters, the Kolmogorov–Smirnov detector procedure will yield,
with high probability, the correct number of change points and a vanishing
localization rate. In fact, as shown below in Section 3.2, the resulting localization
rates are nearly minimax rate optimal (achieving the minimax risk, e.g. Section
2.1 in Tsybakov, 2009).
Theorem 3.1 (Guarantees for the Kolmogorov–Smirnov detector). Assume the
nt ,T
inputs of Algorithm 1 are as follows. (i) The sequence {Yt,i }i=1,t=1 satisfies Con-
ditions 1.1 and 3.1. (ii) The collection of intervals {(αm , βm )}Mm=1 ⊂ {1, . . . , T },
with endpoints drawn independently and uniformly from {1, . . . , T }, satisfy
almost surely, for an absolute constant CM > 0. (iii) The tuning parameter τ
satisfies
cτ,1 log1/2 (n1:T ) ≤ τ ≤ cτ,2 κδ 1/2 n1/2 , (3.2)
where cτ,1 , cτ,2 > 0 are absolute constants.
Let {η̂k }Kk=1 be the corresponding output of Algorithm 1. Then, there exists a
constant C > 0 such that
P K = K and k ≤ C κ−2 log(n1:T )n−1 , ∀k = 1, . . . , K
k
24 log(n1:T ) 48T T M δ2
≥1 − − − exp log − , (3.3)
T 3 n1:T n1:T log(n1:T )δ δ 16T 2
where k = |η̂k − ηk | for k = 1, . . . , K.
If M log(T /δ)T 2 δ −2 , then the probability in (3.3) approaches to 1 as
T → ∞, which shows that Algorithm 1 is consistent.
Based on Condition 3.1, the range of tuning parameters τ defined in (3.2) is
not empty, and the upper bound on the localization error rate satisfies
k
= max φ−2
T → 0, T → ∞,
δ k=1,...,K δ
where the inequality follows from Condition 3.1. It is important to point out that
Theorem 3.1 yields a family of localization rates that depend on how n, κ and δ
scale with T . The slow rate φ−2
T exhibited in the last display represents the worst
case scenario corresponding to the weakest possible signal-to-noise ratio afforded
in Condition 3.1. In fact, for most combinations of the model parameters, the
rate can be much faster. For instance, when κ is bounded away from zero, the
resulting rate is of order O log(n1:T )n−1 T −1 φT , if δ T . Still assuming a non-
vanishing κ, and provided that n increases with T , our Kolmogorov–Smirnov
1162 O. H. Madrid Padilla et al.
where the infimum is over all possible estimators of the change point locations.
In our next result, we derive a minimax lower bound on the localization
task, which applies to nearly all combinations of model parameters outside the
impossibility region found in Lemma 3.1.
Lemma 3.2. Let {Yt,i }n,T
i=1,t=1 be a time series satisfying Condition 1.1 with one
T
and only one change point. Let Pκ,n,δ denote the corresponding joint distribution.
Consider the class of distributions
Q = Pκ,n,δ
T
: δ ≤ T /2, κ < 1/2, κ(δn)1/2 ≥ ζT ,
1164 O. H. Madrid Padilla et al.
where {ζT } is any sequence such that limT →∞ ζT = ∞. Then, for all T large
enough, it holds that
−1
inf sup EP η − η(P ) ≥ max 1, 2e2 nκ2 ,
P ∈Q
η
where the infimum is over all possible estimators of the change point locations.
Note that the condition δ ≤ T /2 automatically holds due to the definition
of δ. The above lower bound matches, saving for a poly-logarithmic factor,
the localization rate we have established in Theorem 3.1, thus showing that
Algorithm 1 is nearly minimax rate optimal.
Algorithm 1 calls for three tuning parameters: the upper bound of random
interval widths CM , the number of random intervals M and the threshold τ .
In this subsection, we discuss the practical guidances on choosing the tuning
parameters.
The constant CM is involved in (3.1) and purely for theoretical purposes.
Since we allow the number of change points to diverge, to prompt the nearly
minimax optimality, the constant CM is required in Theorem 3.1. In practice,
there are only finite change points in any given data sets, hence (3.1) automat-
ically holds and this is not a tuning parameter in use.
For the number of random intervals M , as stated in Theorem 3.1, we need to
choose M to satisfy that M log(T /δ)T 2 δ −2 . However, in practice, δ is likely
unknown but it holds that δ T , which leads to M 1. In all the numerical
experiments in Section 4, we let M = 120, which works well in practice.
Arguably, the most important tuning parameter in Algorithm 1 is τ , whose
value determines whether a candidate time point should be deemed a change
point. If we let τ decrease from ∞ to 0, then the procedure produces more and
more change points. In particular, if all the other inputs, namely {Yt,i } and
{(αm , βm )}, are kept fixed, then it holds that B(τ1 ) ⊆ B(τ2 ), for τ1 ≥ τ2 , where
B(τ ) is the collection of estimated change points returned by Algorithm 1 when
a value of τ for the threshold parameter is used. We take advantage of such
nesting in order to design a data-driven method for picking τ .
To proceed, we now introduce Algorithm 2. Algorithm 2 is a generic procedure
that can be used for merging two collections of estimated change points B1
and B2 . Algorithm 2 deletes from B1 ∪ B2 potential false positives by checking
their validity one by one based on the CUSUM Kolmogorov–Smirnov statistics.
However, Algorithm 2 does not scan for potential false positives in the set B1 ∩B2 .
The criterion deployed in Algorithm 2 is based on the following check:
η nt
2 nt
η̂k+1 2
1{Yt,i ≤ẑ} − F(η̂k +1):η (ẑ) + 1{Yt,i ≤ẑ} − F(η+1):η̂k+1 (ẑ)
t=η̂k +1 i=1 t=η+1 i=1
Optimal nonparametric change point analysis 1165
C ← (B2 \ B1 ) ∪ (B1 \ B2 )
nc ← |C|
B ← B1 ∩ B2
for i = 1, . . . , nc do
η ← ηi ∈ C
if η ∈ B2 \ B1 then
Set k to be the integer satisfying η ∈ (η̂k , η̂k+1 ), where {η̂k , η̂k+1 } ⊂ B1
else
Set k to be the integer satisfying η ∈ (η̂k , η̂k+1 ), where {η̂k , η̂k+1 } ⊂ B2
end if
η
ẑ ← min arg max T ,nt D (z)
z∈{Y }
t,i t=1,i=1 η̂ ,η̂
k k+1
nt
η̂k+1 2
+λ< 1{Yt,i ≤ẑ} − F(η̂k +1):η̂k+1 (ẑ) , (3.6)
t=η̂k +1 i=1
for j = 1, . . . , J do
Bj ← KSD((0, T ), {(αm , βm )}M m=1 , τj , {Wt,i }) Algorithm 1
O ← BJ
end for
for j = 1, . . . , J − 1 do
if Bj+1 = O then
η ← min{x : x ∈ O \ Bj+1 }
Set k to be the integer satisfying η ∈ (η̂k , η̂k+1 ), where {η̂k , η̂k+1 } ⊂ Bj+1
η
ẑ ← min arg max T ,nt D (z, {Y t,i })
z∈{Y }
t,i t=1,i=1 η̂k ,η̂k+1
if (3.6) holds then
O ← Bj+1
else
Terminate the algorithm
end if
else
O ← Bj+1
end if
end for
OUTPUT: O
nt ,T
Theorem 3.2. Suppose that the following holds. (i) The sequences {Yt,i }i=1,t=1 ,
nt ,T
{Wt,i }i=1,t=1 are independent and satisfy Conditions 1.1 and 3.1. (ii) The col-
m=1 ⊂ {1, . . . , T }, whose endpoints are drawn
lection of intervals {(αm , βm )}M
independently and uniformly from {1, . . . , T }, satisfy maxm=1,...,M (βm − αm ) ≤
CM δ, almost surely, for an absolute constant CM > 1. (iii) The tuning param-
eters {τj }Jj=1 satisfy
τJ > . . . > cτ,2 κδ 1/2 n1/2 > . . . > τj ∗ > . . . > cτ,1 log1/2 (n1:T ) > . . . > τ1 , (3.7)
where cτ,1 , cτ,2 > 0 are absolute constants and for some j ∗ ∈ {2, . . . , J − 1}.
Let B = {η̂1 , . . . , η̂K̂ } be the output of Algorithm 3 with inputs satisfying the
conditions above. If λ = C log(n1:T ), with a large enough constant C > 0, then
P K = K and k ≤ C κ−2 log(n1:T )n−1 , ∀k = 1, . . . , K
k
48 log(n1:T ) 96T T M δ2
≥1 − − − exp log − .
T 3 n1:T n1:T log(n1:T )δ δ 16T 2
The proof of Theorem 3.2 can be found in Appendix D. It implicitly as-
sumes that the nested sets {Bj } in Algorithm 3 satisfy |Bj \ Bj+1 | = 1, for
j = 1, . . . , J. If this condition is not met, then the conclusion of Theorem 3.2
still holds provided that we replace the inequality condition in Algorithm 3, with
the inequality λ > maxm=1,...,M supz∈R |Daηm ,bm (z, {Yt,i })|2 , where (am , bm ) =
ηk , ηk+1 ) ∩ (αm , βm ) for m = 1, . . . , M .
(
A similar proposal for selecting the threshold tuning parameter for the wild
binary segmentation algorithm can be found in Theorem 3.3 in Fryzlewicz
Optimal nonparametric change point analysis 1167
(2014). The proof of our Theorem 3.2 delivers a more careful and precise anal-
ysis.
Finally, Theorem 3.2 suggests to choose the parameter λ in Algorithm 3 as
λ = C log(n1:T ). In practice, we set C = 2/3 and find this choice performing
reasonably well. We also include further simulations in Appendix E showing
that Algorithm 3 is less sensitive to the parameter C than Algorithm 1 is to the
parameter τ .
4. Numerical experiments
4.1. Simulations
Fig 1. A plot showing the densities considered in Scenario 1 (Padilla et al., 2018).
Fig 2. A plot showing realizations of different scenarios with T = 8000. From left to right
and from top to bottom, the panels are from Scenarios 2, 3, 4 and 5, respectively.
2014; Pein et al., 2019), heterogeneous simultaneous multiscale change point es-
timators Pein, Sieling and Munk (2017); Pein et al. (2019), the nonparametric
multiple change point detection of Zou et al. (2014) and Haynes, Fearnhead and
Eckley (2017), the robust functional pruning optimal partitioning method of
(Fearnhead and Rigaill, 2018) and the kernel change point detection procedure
of (Celisse et al., 2018; Arlot, Celisse and Harchaoui, 2019). For all the compet-
ing methods, we set the respective tuning parameters by default choices. For the
kernel change point detection procedure, we use the function KernSeg MultiD
in the R (R Core Team, 2019) package kernseg (Marot et al., 2018). The func-
tion produces a sequence of candidate models, each of which is associated with
a measure of fit. We choose the best model based on the elbow criterion from
the sequence of measures of fit.
We apply Algorithm 3 with λ = 2/3 log(n1:T ), a choice that is guided by
Theorem 3.2 and that we find reasonable in practice (see Appendix E for sensi-
tivity simulations comparing Algorithms 1 and 3). Moreover, we construct the
samples {Yt,i } and {Wt,i } by splitting the data into two time series having odd
and even time indices. We set the number of random intervals as M = 120.
We explain the different generative models that are deployed in our sim-
ulations. For all scenarios, we consider T ∈ {1000, 4000, 8000}. Moreover, we
consider the partition P of {1, . . . , T } induced by the change points η0 = 1 <
η1 < . . . < ηK < ηK+1 = T + 1, which are evenly spaced in {1, . . . , T }. The ele-
ments of P are A1 , . . . , AK+1 , with Aj = [ηj−1 , ηj −1]. We consider the following
scenarios.
Scenario 1. Let K = 7 for each instance of T . Define Ft to have probability
density function as in the left panel of Figure 1 for t ∈ Aj with odd j and as in
the right panel of Figure 1 for t ∈ Aj with even j.
Scenario 2. Let K = 2−1/2 T 1/2 log−1/2 (T ) and define θ ∈ RT as θt = 1,
t ∈ Aj with odd j, and θt = 0 otherwise. Let data be generated as yt =
Optimal nonparametric change point analysis 1169
Table 1
KSD, Kolmogorov–Smirnov detector; WBS, wild binary segmentation; PELT, pruned exact
linear time algorithm; S3IB, pruned dynamic programming algorithm; NMCD,
nonparametric multiple change point detection; SMUCE, simultaneous multiscale change
point estimators; B&P, Bai and Perron’s method; HSMUCE, heterogeneous simultaneous
multiscale change point estimators; RFPOP, robust functional pruning optimal partitioning
methods; KCP, kernel change point detection.
Scenario 1.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 1.5 11.0 4.35 6.3 1.8 4.35 6.85 1.6 1.8 4.4
ˆ
1000 d(C|C) 32.0 63.0 64.5 ∞ 23.0 43.5 ∞ 75.5 13.0 637.0
1000 d(C|C)ˆ 52.0 82.0 90.0 −∞ 36.0 93.5 −∞ 56.5 17.0 23.0
4000 |K − K̂| 0.2 0.5 21.4 6.9 2.2 52.0 4.9 8.3 0.24 1.4
ˆ
4000 d(C|C) 31.0 43.5 46.5 ∞ 13.5 448.0 1386.0 70.0 10.0 108.0
4000 d(C|C)ˆ 31.0 364.5 448 −∞ 69.5 32.7 101.5 319.0 12.0 27.0
8000 |K − K̂| 0.0 0.9 41.2 5.35 3.25 62.7 3.7 17.6 0.24 0.9
ˆ
8000 d(C|C) 38.0 29.5 55.0 ∞ 10.0 70.0 1831 84.5 11.0 24.0
8000 d(C|C)ˆ 38.0 818.5 955.0 −∞ 227.0 958.0 206.5 910.0 11.0 34.0
Scenario 2.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 1.3 9.9 2.7 1.85 1.7 4.6 7.05 0.45 0.1 1.76
ˆ
1000 d(C|C) 11.0 8.5 9.0 15.5 7.0 10.5 738.5 23.0 6.0 12
1000 d(C|C)ˆ 13.0 54.5 45.5 29.5 28.0 53.5 16.0 22.0 6.0 5.0
4000 |K − K̂| 0.0 15.4 10.05 14.7 3.35 14.8 11.95 0.0 0.1 0.6
ˆ
4000 d(C|C) 16.0 9.5 12.0 ∞ 9.0 35.0 1007 16.0 8.0 8.0
4000 d(C|C)ˆ 16.0 176 163.0 −∞ 107.5 164 20.0 16.0 8.0 8.0
8000 |K − K̂| 1.3 8.4 18.45 20.8 4.75 28.5 18.4 0.1 0.2 0.0
ˆ
8000 d(C|C) 363.0 15.5 9.5 ∞ 13.5 40.0 2179 20.0 9.0 7.0
8000 d(C|C)ˆ 18.0 254.5 2179 −∞ 129.5 257 19.5 20.0 10.0 7.0
Scenario 3.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 0.8 0.0 0.0 0.1 0.8 0.0 0.0 0.2 0.0 2.3
ˆ
1000 d(C|C) 16.0 9.0 8.5 8.5 9.5 8.5 8.5 9.5 9.0 140.0
1000 d(C|C)ˆ 19.0 9.5 8.5 0.9 10.5 8.5 8.5 9.5 9.0 7.0
4000 |K − K̂| 0.1 0.0 0.0 0.0 1.8 0.0 0.0 0.0 0.1 0.2
ˆ
4000 d(C|C) 22.0 8.0 9.0 8.0 11.5 8.0 8.0 9.5 8.0 11.0
4000 d(C|C)ˆ 20.0 8.0 9.0 8.0 136.5 8.0 8.0 8.0 8.0 9.0
8000 |K − K̂| 0.2 0.0 0.0 0.1 1.7 0.0 0.0 0.2 0.0 0.2
ˆ
8000 d(C|C) 11.5 6.0 6.0 6.0 8.5 6.0 6.0 6.0 6.0 6.0
8000 d(C|C)ˆ 11.5 6.0 6.0 6.0 353 6.0 6.0 6.5 6.0 6.0
Finally, Scenario 5 seems to be the most challenging one for all methods. In fact,
Kolmogorov–Smirnov detector and kernel change point detection seem to be the
only methods capable of estimating correctly the numbers of change points, with
the kernel change point detection procedure yielding smaller localization rates.
In our second set of simulations we study the case where the number of
data points collected at any time can be more than one. We consider the same
5 scenarios, same tuning parameter selection method and same performance
Optimal nonparametric change point analysis 1171
Table 2
KSD, Kolmogorov–Smirnov detector; WBS, wild binary segmentation; PELT, pruned exact
linear time algorithm; S3IB, pruned dynamic programming algorithm; NMCD,
nonparametric multiple change point detection; SMUCE, simultaneous multiscale change
point estimators; B&P, Bai and Perron’s method; HSMUCE, heterogeneous simultaneous
multiscale change point estimators; RFPOP, robust functional pruning optimal partitioning
methods; KCP, kernel change point detection.
Scenario 4.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 0.9 4.0 9.8 4.9 2.45 27.75 5.0 4.7 13.8 0.4
ˆ
1000 d(C|C) 36.0 ∞ 40.0 ∞ 4.0 24.5 ∞ ∞ 82.0 5.0
1000 d(C|C)ˆ 32.0 −∞ 153.5 −∞ 67.0 157 −∞ −∞ 157.0 5.0
4000 |K − K̂| 0.0 3.8 36.3 5.0 2.7 71.1 5.0 4.5 44.8 0.1
ˆ
4000 d(C|C) 19.0 ∞ 106.5 ∞ 4.5 46.0 ∞ ∞ 125.0 5.0
4000 d(C|C)ˆ 19.0 −∞ 644.5 −∞ 66.0 651.5 −∞ −∞ 647.0 5.0
8000 |K − K̂| 0.1 3.5 60.3 5.0 4.0 109.3 5.0 4.5 71.4 0.0
ˆ
8000 d(C|C) 23.0 6301.5 115.0 ∞ 2.5 47.5 ∞ ∞ 135 6.0
8000 d(C|C)ˆ 28.0 238 1300.5 −∞ 566.5 1316 −∞ −∞ 1293 6.0
Scenario 5.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P H-SMUCE RFPOP KCP
1000 |K − K̂| 0.4 3.6 7.2 5.0 1.5 26.95 5.0 4.8 1.96 0.1
ˆ
1000 d(C|C) 27.0 ∞ 70.0 ∞ 4.5 21.0 ∞ ∞ ∞ 9.0
1000 d(C|C)ˆ 29.0 −∞ 147.0 −∞ 32 159.0 −∞ −∞ −∞ 8.0
4000 |K − K̂| 0.1 3.5 38.2 5.0 3.35 72.05 5.0 4.5 2.0 0.1
ˆ
4000 d(C|C) 24.0 ∞ 82.0 ∞ 3.0 39.5 ∞ ∞ ∞ 12.0
4000 d(C|C)ˆ 25.0 −∞ 629.5 −∞ 275 640.5 −∞ −∞ −∞ 12.0
8000 |K − K̂| 0.0 4.2 63.9 5.0 4.4 114.0 5.0 4.6 1.84 0.1
ˆ
8000 d(C|C) 37.0 ∞ 107.5 ∞ 2.5 55.0 ∞ ∞ ∞ 12.0
8000 d(C|C)ˆ 37.0 −∞ 1309 −∞ 552 1310.5 −∞ −∞ −∞ 12.0
Table 3
Performance evaluations for the KSD method in settings where nt can be larger than 1.
Scenario 1.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.1 0.2 0.0 0.6 0.3 0.0
ˆ
d(C|C) 6.5 2.5 2.0 6.0 2.0 1.5
Scenario 2.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.1 0.0 0.0 0.4 0.0 0.0
ˆ
d(C|C) 3.0 1.0 0.0 3.0 1.0 0.0
Scenario 3.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.3 0.3 0.0 0.4 0.0 0.0
ˆ
d(C|C) 6.5 1.0 0.5 5.0 2.0 1.0
Scenario 4.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.2 0.0 0.0 0.0 0.0 0.0
ˆ
d(C|C) 6.0 2.0 0.0 5.0 1.0 1.0
Scenario 5.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.1 0.0 0.0 0.0 0.0 0.0
ˆ
d(C|C) 9.5 3.0 2.0 6.0 5.5 6.0
Fig 3. A plot showing individual 1 in the array comparative genomic hybridization data set,
with estimated change points indicating by vertical lines. From left to right, the panels are
based on the estimators from the Kolmogorov–Smirnov detector, wild binary segmentation,
nonparametric multiple change point detection and robust functional pruning optimal parti-
tioning methods, respectively.
Figure 3 shows that all the methods seem to recover the important change
points in the time series associated with individual 1 in the data set. A po-
tential advantage of the Kolmogorov–Smirnov detector method is that it seems
less sensitive to potentially-spurious change points as opposed to wild binary
segmentation, nonparametric multiple change point detection and robust func-
tional pruning optimal partitioning.
Optimal nonparametric change point analysis 1173
Appendix A: Comparisons
We compare our rates with those in the univariate mean change point detec-
tion problem, which assumes sub-Gaussian data (e.g. Wang, Yu and Rinaldo,
2020). On one hand, this comparison inherits the main arguments when com-
paring parametric and nonparametric modelling methods in general. Especially
with the general model assumption we impose on the underlying distribution
functions, we enjoy risk-free from model mis-specification. On the other hand,
seemingly surprisingly, we achieve the same rates of those in the univariate
mean change point detection case, even though sub-Gaussianity is assumed
thereof. In fact, this is expected. We are using the empirical distribution function
in our CUSUM Kolmogorov–Smirnov statistic, which is essentially a weighted
Bernoulli random variable at every z ∈ R. Due to the fact that Bernoulli ran-
dom variables are sub-Gaussian, and that the empirical distribution functions
are step functions with knots only at the sample points, we are indeed to expect
the same rates.
Furthermore, the heterogeneous simultaneous multiscale change point es-
timator from Pein, Sieling and Munk (2017) can also be compared to the
Kolmogorov–Smirnov detector. Assuming Gaussian errors, δ T and K =
O(1), Theorems 3.7 and 3.9 in Pein, Sieling and Munk (2017) proved that het-
erogeneous simultaneous multiscale change point estimator can consistently es-
timate the number of change points, and that r(T ), for any r(T ) sequence
such that r(T )/ log(T ) → ∞. This is weaker than our upper bound that guar-
antees log(T ). The Kolmogorov–Smirnov detector can handle changes in
variance when the mean remains constant, a setting where it is unknown if the
heterogeneous simultaneous multiscale change point estimator is consistent.
Another interesting contrast can be made between the Kolmogorov–Smirnov
detector and the multiscale quantile segmentation method in Vanegas, Behr
and Munk (2019). Both algorithms make no assumptions on the distributional
form of the cumulative distribution functions. However, the multiscale quantile
segmentation is designed to identify changes in one known quantile. This is
not a requirement for the Kolmogorov–Smirnov detector which can detect any
type of changes without previous knowledge of their nature. As for statistical
guarantees, translating to our notation, provided that δ log(T ), multiscale
quantile segmentation can consistently estimate the number of change points
and have log(T ). This matches our theoretical guarantees in Theorem 3.1.
We compare the theoretical properties of the Kolmogorov–Smirnov detector
with the ones of the nonparametric multiple change point detection procedure in
Zou et al. (2014). Both methods guarantee consistent change point localization
of univariate sequences in fully nonparametric settings.
• We measure the magnitude κ of the distribution changes at the change
points using the Kolmogorov–Smirnov distance, as in (1.1). In contrast,
Zou et al. (2014) deploy a weighted Kullback–Leibler divergence, see As-
sumption A4 in their paper, which is stronger than the Kolmogorov–
Smirnov distance, and therefore more discriminative. At the same time,
1174 O. H. Madrid Padilla et al.
Finally, we compare with the KCP method, studied in Arlot, Celisse and
Harchaoui (2019) and Garreau and Arlot (2018). Specifically, the following dis-
cussions are based on translating Theorem 3.1 in Garreau and Arlot (2018) to
our notation, and assuming nt = 1, t ∈ {1, . . . , T } in our setting.
where 1/2
ns:t n(t+1):e
Δts,e (z) = Fs:t (z) − F(t+1):e (z) , (B.1)
ns:e
and
1
e
Fs:e (z) = nt Ft (z).
ns:e t=s
and
max min{ηk − s, s − ηk−1 }, min{ηk+q+1 − e, e − ηk+q } ≤ ,
where q = −1 indicates that there is no change point contained in (s, e).
By Condition 3.1, there exists an absolute constant c > 0 such that
n7max
≤c δ ≤ δ/4.
n7min
It has to be the case that for any change point ηk ∈ (0, T ), either |ηk − s| ≤ or
|ηk −s| ≥ δ − ≥ 3δ/4. This means that min{|ηk −s|, |ηk −e|} ≤ indicates that
ηk is a detected change point in the previous induction step, even if ηk ∈ (s, e).
We refer to ηk ∈ (s, e) an undetected change point if min{|ηk − s|, |ηk − e|} ≥
3δ/4.
In order to complete the induction step, it suffices to show that we (i) will not
detect any new change point in (s, e) if all the change points in that interval have
been previous detected, and (ii) will find a point b ∈ (s, e) such that |ηk − b| ≤
if there exists at least one undetected change point in (s, e).
For j = 1, 2, define the events
e
nk
(j)
Aj (γ) = max sup wk 1{Yk,i ≤z} − E 1{Yk,i ≤z} ≤ γ ,
1≤s<b<e≤T z∈R
k=s i=1
where
⎧ 1/2
⎪
⎨ n(b+1):e , k = s, . . . , b,
and wk = n−1/2
(1) ns:b ns:e (2)
wk = 1/2 s:e .
⎪
⎩− ns:b
n(b+1):e ns:e , k = b + 1, . . . , e,
1176 O. H. Madrid Padilla et al.
Define
K
S= αs ∈ [ηk − 3δ/4, ηk − δ/2], βs ∈ [ηk + δ/2, ηk + 3δ/4],
k=1
for some s = 1, . . . , M .
Set γ to be Cγ log1/2 (n1:T ), with a sufficiently large constant Cγ > 0. The rest of
the proof assumes the event A1 (γ) ∩ A2 (γ) ∩ S, the probability of which can be
lower bounded using Lemma B.3 and also Lemma 13 in Wang, Yu and Rinaldo
(2020).
Step 1. In this step, we will show that we will consistently detect or reject
the existence of undetected change points within (s, e). Let am , bm and m∗ be
defined as in Algorithm 1. Suppose there exists a change point ηk ∈ (s, e) such
that min{ηk −s, e−ηk } ≥ 3δ/4. In the event S, there exists an interval (αm , βm )
selected such that αm ∈ [ηk − 3δ/4, ηk − δ/2] and βm ∈ [ηk + δ/2, ηk + 3δ/4].
Following Algorithm 1, (sm , em ) = (αm , βm ) ∩ (s, e). We have that min{ηk −
sm , em − ηk } ≥ (1/4)δ and (sm , em ) contains at most one true change point.
It follows from Lemma B.4, with c1 there chosen to be 1/4, that
t κδnmin
3/2
max Δs ≥ ,
m ,em
t=sm +1,...,em −1 8(em − sm )1/2 nmax
Therefore
Thus for any undetected change point ηk ∈ (s, e), it holds that
3/2
1 nmin
a m∗ = max am ≥ 1/2
κδ 1/2 − γ ≥ cτ,2 κδ 1/2 n1/2 , (B.2)
m=1,...,M 8CM nmax
where the last inequality is from the choice of γ, the fact nmin nmax and
cτ,2 > 0 is achievable with a sufficiently large CSNR in Condition 3.1. This
means we accept the existence of undetected change points.
Recalling the notation {k }K
k=1 we introduced in (3.3). Here with some abuse
of notation, we let
k = C κ−2
k log(n1:T )n
−1
, k = 1, . . . , K.
Suppose that there are no undetected change points within (s, e), then for any
(sm , em ), one of the following situations must hold.
(a) There is no change point within (sm , em );
Optimal nonparametric change point analysis 1177
(b) there exists only one change point ηk ∈ (sm , em ) and min{ηk − sm , em −
ηk } ≤ k ; or
(c) there exist two change points ηk , ηk+1 ∈ (sm , em ) and ηk − sm ≤ k ,
em − ηk+1 ≤ k+1 .
Observe that if (a) holds, then we have
therefore
Under (3.2), we will always correctly reject the existence of undetected change
points.
Step 2. Assume that there exists a change point ηk ∈ (s, e) such that min{ηk −
s, e − ηk } ≥ 3δ/4. Let sm , em and m∗ be defined as in Algorithm 1. To complete
the proof it suffices to show that, there exists a change point ηk ∈ (sm∗ , em∗ )
such that min{ηk − sm∗ , ηk − em∗ } ≥ δ/4 and |bm∗ − ηk | ≤ .
To this end, we are to ensure that the assumptions of Lemma B.9 are verified.
Note that (B.26) follows from (B.2), (B.27) and (B.28) follow from the definitions
of events A1 (γ) and A2 (γ), and (B.29) follows from Condition 3.1.
Thus, all the conditions in Lemma B.9 are met. Therefore, we conclude that
there exists a change point ηk , satisfying
and
n9max −2 2
|bm∗ − ηk | ≤ C κ γ ≤ ,
n10
min
where the last inequality holds from the choice of γ and Condition 3.1.
The proof is completed by noticing that (B.3) and (sm∗ , em∗ ) ⊂ (s, e) imply
that
min{e − ηk , ηk − s} > δ/4 > .
As discussed in the argument before Step 1, this implies that ηk must be an
undetected change point.
1178 O. H. Madrid Padilla et al.
Below are a number of auxiliary lemmas. Lemma B.1 plays the role of Lemma
2.2 in Venkatraman (1992). Lemma B.3 controls the deviance between sam-
ple and population Kolmogorov–Smirnov statistics. Lemma B.4 is the density
version of Lemma 2.4 in Venkatraman (1992). Lemma B.5 plays the role of
Lemma 2.6 of Venkatraman (1992). Lemma B.6 is essentially Lemma 17 in
Wang, Yu and Rinaldo (2020). Lemma B.8 is Lemma 19 in Wang, Yu and Ri-
naldo (2020).
Lemma B.1. Under Condition 1.1, for any pair (s, e) ⊂ (1, T ) satisfying
Then b1 ∈ {η1 , . . . , ηK }.
(ii) Let z ∈ arg maxx∈R |Δbs,e 1
(x)|. If Δts,e (z) = 0 for some t ∈ (s, e), then
|Δs,e (z)| is either monotonic or decreases and then increases within each
t
Note that due to the fact for any cumulative distribution function F : R → [0, 1],
it holds that F (−∞) = 1 − F (∞) = 0, we have that z0 ∈ R exists.
Therefore,
b1 ∈ arg max |Δbs,e (z0 )|.
b=s+1,...,e−1
The set of change points of the time series {rl (z0 )}nl=1
s:e
is
ns:ηk , . . . , ns:ηk+q .
Optimal nonparametric change point analysis 1179
= max Δηs,e
j
(z0 ) ≤ max Δηs,e
j
,
j∈{k,...,k+q} j∈{k,...,k+q}
which is a contradiction.
As for (ii), it follows from applying Lemma 2.2 from Venkatraman (1992) to
{rl (z0 )}nl=1
s:e
.
Lemma B.2. Under Condition 1.1, let t ∈ (s, e). It holds that
Δts,e ≤ 2n1/2
max min{(s − t + 1)
1/2
, (e − t)1/2 }. (B.4)
Δηs,e
k
≤ κk n1/2
max min{(s − ηk + 1)
1/2
, (e − ηk )1/2 }. (B.5)
If (s, e) ⊂ (1, T ) contains two and only two change points ηk and ηk+1 , then
we have
≤ max min{(s − ηk + 1)
κk n1/2 1/2
, (e − ηk )1/2 }.
b
where Ds,e (z) and Δbs,e (z) are the sample and population versions of the
Kolmogorov–Smirnov statistic defined in Definition 2.1 and (B.1), respectively.
It holds that
T4
1/2
pr max sup Λbs,e (z) > log + log(n1:T ) + 6 log1/2 (n1:T )
1≤s<b<e≤T z∈R 12δ
48 log(n1:T ) 12 log(n1:T ) 24T
+ 1/2
≤ 3n
+ .
n T 1:T n1:T log(n1:T )δ
1:T
Moreover
−1/2
e nt
pr max sup ns:e 1Yt,i ≤z − E(1Yt,i ≤z )
1≤s<e≤T z∈R
t=s i=1
1/2
T4 48 log(n1:T )
> log + log(n1:T ) + 6 log1/2 (n1:T ) + 1/2
12δ n 1:T
12 log(n1:T ) 24T
≤ 3
+ . (B.7)
T n1:T n1:T log(n1:T )δ
Remark B.1. Lemma B.3 shows that as T diverges unbounded, it holds that
max sup Λbs,e (z) = Op log1/2 (n1:T ) .
1≤s<b<e≤T z∈R
ns:b n(b+1):e
1/2 1/2
1/2
Fs:b (z) − F(b+1):e (z)
ns:e
1/2
b
nk
n(b+1):e e nk 1/2
ns:b
= 1/2 1/2
1 {Yk,i ≤z} − 1/2
1
1/2 {Yk,i ≤z}
k=s i=1 ns:b ns:e k=b+1 i=1 n(b+1):e ns:e
e
nk
= wk 1{Yk,i ≤z} ,
k=s i=1
where ⎧ 1/2
⎪
⎨ n(b+1):e
ns:b ns:e , k = s, . . . , b;
wk = 1/2 (B.8)
⎪
⎩− ns:b
n(b+1):e ns:e , k = b + 1, . . . , e.
Therefore, we have
nk
ns:b n(b+1):e 1/2
e
Fs:b (z) − F(b+1):e (z) = wk E 1{Yk,i ≤z} ,
ns:e i=1
k=s
e
n k e nk
b
Ds,e = sup wk b
1{Yk,i ≤z} and Δs,e = sup wk E 1{Yk,i ≤z} .
z∈R i=1
z∈R
i=1
k=s k=s
Optimal nonparametric change point analysis 1181
Since
e nk
b
Ds,e = sup wk E 1{Yk,i ≤z} + 1{Yk,i ≤z} − E 1{Yk,i ≤z}
z∈R i=1
k=s
e nk
e nk
≤ sup wk E 1{Yk,i ≤z} + sup wk 1{Yk,i ≤z} − E 1{Yk,i ≤z}
z∈R i=1
z∈R
i=1
k=s k=s
e
n k
= Δbs,e + sup wk 1{Yk,i ≤z} − E 1{Yk,i ≤z} ,
z∈R
k=s i=1
we have
b e
nk
Ds,e − Δbs,e ≤ sup wk 1{Yk,i ≤z} − E 1{Yk,i ≤z}
. (B.9)
z∈R
k=s i=1
where m is a positive integer to be specified. Let I1k = (−∞, sk1 ], Ijk = (skj−1 , skj ],
j = 2, . . . , m − 1, and Im k
= (skm−1 , ∞). With this notation, for any k ∈
{1, . . . , n}, we get a partition of R, namely Ik = {I1k , . . . , Im
k
}. Let I = ∩Tk=1 Ik =
{I1 , . . . , IM }. Note that there are at most T /δ distinct Ik ’s, and therefore
M ≤ T m/δ.
Let also zj be an interior point of Ij for all j ∈ {1, . . . , M }. Then
sup |Λbs,e (z)| ≤ max |Λbs,e (zj )| + sup |Λbs,e (zj ) − Λbs,e (z)| . (B.10)
z∈R j=1,...,M z∈Ij
By Hoeffding’s inequality and a union bound argument, we have for any ε > 0
2T 4 m
P max max |Λbs,e (zj )| > ε ≤ exp −2ε2 , (B.11)
1≤s<b<e≤T j=1,...,M δ
since
e
nk
wk2 = 1.
k=s i=1
On the other hand, for j ∈ {1, . . . , M }, let z ∈ Ij and without loss of gener-
ality assume that zj < z. Let
r(k)
vj = |{(i, k) : k ∈ {1, . . . , T }, i ∈ {1, . . . , nk } and yk,i ∈ Iv(j,k) }|.
2T 4 m n1:T
ε = log1/2 and m= ,
δ 24 log(n1:T )
12 log(n1:T ) 24T
≤ + .
T 3 n1:T n1:T log(n1:T )δ
As for the result (B.7), we only need to change (B.8) to wk = (ns:e )−1/2 .
Lemma B.4. Under Condition 1.1, let 1 ≤ s < ηk < e ≤ T be any interval
satisfying
min{ηk − s, e − ηk } ≥ c1 δ,
with c1 > 0. Then we have that
3/2
c1 κδnmin c1 κδnmin
max Δts,e ≥ 1/2
≥ 1/2
.
t=s+1,...,e−1 2(e − s)1/2 nmax 2(e − s)1/2 nmax
Proof. Let
z0 ∈ arg max |Fηk (z) − Fηk+1 (z)|.
z∈R
Without loss of generality, assume that Fηk (z0 ) > Fηk+1 (z0 ). For s < t < e, it
holds that
1 1
1/2 t e
ns:e ns:t
t
Δs,e (z0 ) = nl Fl (z0 ) − nl Fl (z0 )
n(t+1):e ns:t ns:e
l=s l=s
1/2 t
ns:e
= nl Fl (z0 ),
ns:t n(t+1):e
l=s
e
where Fl (z0 ) = Fl (z0 ) − (ns:e )−1 l=s nl Fl (z0 ).
Due to Condition 1.1, it holds that Fηk (z0 ) > κ/2. Therefore
ηk
nl Fl (z0 ) ≥ (c1 /2)κnmin δ
l=s
In the following lemma, the condition (B.16) follows from Lemma B.4,and
(B.17) follows from Lemma B.3.
Lemma B.5. Let z0 ∈ R, (s, e) ⊂ (1, T ). Suppose that there exits a true change
point ηk ∈ (s, e) such that
min{ηk − s, e − ηk } ≥ c1 δ, (B.15)
1184 O. H. Madrid Padilla et al.
and
3/2
nmin κδ
Δηs,e
k
(z0 ) ≥ (c1 /2) , (B.16)
nmax (e − s)1/2
where c1 > 0 is a sufficiently small constant. In addition, assume that
T4 κδ 4 n5min
max |Δts,e (z0 )| − Δηs,e
k
(z0 ) ≤ 3 log + 3 log n1:T ≤ 9/2
.
s<t<e δ (e − s)7/2 nmax
(B.17)
Then there exists d ∈ (s, e) satisfying
c1 δn2min
|d − ηk | ≤ , (B.18)
32n2max
and
n2min ηk
Δηs,e
k
(z0 ) − Δds,e (z0 ) > c|d − ηk |δ Δ (z0 )(e − s)−2 ,
n2max s,e
where c > 0 is a sufficiently small constant.
Proof. Let us assume without loss of generality that d ≥ ηk . Following the
argument of Lemma 2.6 in Venkatraman (1992), it suffices to consider two cases:
(i) ηk+1 > e and (ii) ηk+1 ≤ e.
Case (i) ηk+1 > e. It holds that
1/2
N1 N2
Δηs,e
k
(z0 ) = Fηk (z0 ) − Fηk+1 (z0 )
N1 + N2
and
1/2
N2 − N3
Δds,e (z0 ) = N1 Fηk (z0 ) − Fηk+1 (z0 ) ,
(N1 + N3 )(N1 + N2 )
where N1 = ns:ηk , N2 = n(ηk +1):e and N3 = n(ηk +1):d . Therefore, due to (B.15),
we have
1/2
N1 (N2 − N3 )
El = Δs,e (z0 ) − Δs,e (z0 ) = 1 −
ηk d
Δηs,e
k
(z0 )
N2 (N1 + N3 )
N1 + N2
= 1/2 ! 1/2 1/2 "
N3 Δηs,e
k
(z0 )
{N2 (N1 + N3 )} {N2 (N1 + N3 )} + {N1 (N2 − N3 )}
Optimal nonparametric change point analysis 1185
n2min
≥ c1 |d − ηk |δΔηs,e
k
(z0 )(e − s)−2 . (B.19)
n2max
Case (ii) ηk+1 ≤ e. Let N1 = ns:ηk , N2 = n(ηk +1):(ηk +h) and N3 = n(ηk +h+1):e ,
where h = c1 δ/8. Then,
1/2
N1 + N2 + N3
Δηs,e
k
(z0 ) = a and
N1 (N2 + N3 )
1/2
N1 + N2 + N3
Δηs,e
k +h
(z0 ) = (a + N2 θ) .
N3 (N1 + N2 )
where
ηk
1
e
a= nl Fl (z0 ) − c0 , c0 = nl Fl (z0 )
ns:e
l=s l=s
and
a{(N1 + N2 )N3 }1/2 1 1
θ= −
N2 {N1 (N2 + N3 )} 1/2 (N1 + N2 )N3
b
+ ,
a(N1 + N2 + N3 )1/2
ηk +h
with b = Δs,e (z0 ) − Δηs,e
k
(z0 ).
Next, we set l = d − ηk ≤ h/2 and N4 = n(ηk +1):d . Therefore, as in the proof
of Lemma 2.6 in Venkatraman (1992), we have that
El = Δηs,e
k
(z0 ) − Δs,e
ηk +l
(z0 ) = E1l (1 + E2l ) + E3l , (B.20)
where
aN4 (N2 − N4 )(N1 + N2 + N3 )1/2
E1l =
{N1 (N2 + N3 )(N1 + N4 )(N2 + N3 − N4 )}1/2
1
× ,
{(N1 + N4 )(N2 + N3 − N4 )}1/2 + {N1 (N2 + N3 )}1/2
1186 O. H. Madrid Padilla et al.
(N3 − N1 )(N3 − N1 − N4 )
E2l =
{(N1 + N4 )(N2 + N3 − N4 )}1/2 + {(N1 + N2 )N3 }1/2
1
× ,
{N1 (N2 + N3 )}1/2 + {(N1 + N2 )N3 }1/2
and 1/2
bN4 (N1 + N2 )N3
E3l = − .
N2 (N1 + N4 )(N2 + N3 − N4 )
Since N2 − N4 ≥ nmin c1 δ/16, it holds that
n2min ηk
E1l ≥ c1l |d − ηk |δ Δ (z0 )(e − s)−2 , (B.21)
n2max s,e
where c1l > 0 is a sufficiently small constant depending on c1 . As for E2l , due
to (B.18), we have
E2l ≥ −1/2. (B.22)
As for E3l , we have
T4 n2 e − s
E3l ≥ − 3 log + 3 log n1:T |d − ηk | 2min 2 2
δ nmax c1 δ
T4 n2
≥ −c3l 3 log + 3 log n1:T |d − ηk |Δηs,e
k
(z0 )δ(e − s)−2 2min
δ nmax
9/2
nmax log(n1:T )
× (e − s)7/2
n5min κδ 4
n2min ηk
≥ −c1l /2|d − ηk |δ Δ (z0 )(e − s)−2 , (B.23)
n2max s,e
where the first inequality follows from (B.17), the second inequality from (B.16),
and the last from (B.17).
Combining (B.20), (B.21), (B.22) and (B.23), we have
n2min ηk
Δηs,e
k
(z0 ) − Δds,e (z0 ) ≥ c|d − ηk |δ Δ (z0 )(e − s)−2 , (B.24)
n2max s,e
where c > 0 is a sufficiently small constant.
In view of (B.19) and (B.24), we conclude the proof.
Lemma B.6. Suppose (s, e) ⊂ (1, T ) such that e − s ≤ CM δ and that
ηk−1 ≤ s ≤ ηk ≤ . . . ≤ ηk+q ≤ e ≤ ηk+q+1 , q ≥ 0.
Denote
κs,e
max = max κp : p = k, . . . , k + q .
Then for any p ∈ {k − 1, . . . , k + q}, it holds that
1 e
sup nt Ft (z) − Fηp (z) ≤ (CM + 1)κs,e
max .
z∈R ns:e t=s
Optimal nonparametric change point analysis 1187
1
ns:e
Ps,e
d
(x) = xi + x, ψs,e
d
ψs,e
d
,
ns:e i=1
Lemma B.7. Suppose Condition 1.1 holds and consider any interval
(s, e) ⊂ (1, T ) satisfying that there exists a true change point ηk ∈ (s, e). Let
Let
μs,e = Fs (z0 ), . . . , Fs (z0 ), . . . , Fe (z0 ), . . . , Fe (z0 ) ∈ Rns:e
# $% & # $% &
ns ne
and
Ys,e = 1{Ys,1 ≤z0 } , . . . , 1{Ys,ns ≤z0 } , . . . , 1{Ye,1 ≤z0 } , . . . , 1{Ye,ne ≤z0 } ∈ Rns:e .
# $% & # $% &
ns ne
1188 O. H. Madrid Padilla et al.
We have
' '2 ' '2 ' '2
'Ys,e − Ps,e
b
Ys,e ' ≤ 'Ys,e − Ps,e
ηk
Ys,e ' ≤ 'Ys,e − Ps,e
ηk
μs,e ' . (B.25)
where
1
d nt e nt
1
Y1 = 1{Yt,i ≤z0 } , and Y2 = 1{Yt,i ≤z0 } .
ns:d t=s i=1 n(d+1):e
t=d+1 i=1
The second inequality in (B.25) follows from the observation that the sum of
the squares of errors is minimized by the sample mean.
Lemma B.8. Let (s, e) ⊂ (1, T ) contains two or more change points such that
where
1
1/2 t e
ns:t n(t+1):e 1
Gs,e
t
(z) = nl Gl (z) − nl Gl (z) .
ns:e ns:t n(t+1):e
l=s l=t+1
Thus we have
Δηs,e
k
= sup Δηs,e
k
(z) − Gs,e
ηk
(z) + Gs,e
ηk
(z) ≤ sup Δηs,e
k
(z) − Gs,e
ηk
(z) + Gs,e
ηk
z∈R z∈R
1/2
ns:ηk n(ηk+1 +1):e
≤ Gs,e
ηk
s:ηk κk ≤
+ n1/2 Gs,e
ηk+1
+ n1/2
s:ηk κk
ns:ηk+1 n(ηk +1):e
1/2
c1 nmax
≤ Δηs,e
k+1
+ 2n1/2
s:ηk κk .
nmin
Let
s,e = max κp : min{ηp − s0 , e0 − ηp } ≥ δ/16 .
κmax
Consider any generic (s, e) ⊂ (s0 , e0 ), satisfying
3/2
nmin
b
Ds,e ≥ c1 κmax
s,e δ
1/2
, (B.26)
nmax
max sup Λts,e (z) ≤ γ, (B.27)
s<t<e z∈R
and
e nt
max sup n−1/2 1{Yt,i ≤z} − Ft (z) ≤ γ. (B.28)
1≤s<e≤T z∈R
s:e
t=s i=1
n9max −2 2
min{e − ηk , ηk − s} ≥ δ/4 and |ηk − b| ≤ C κ γ ,
n10
min
Proof. Without loss of generality, assume that Δbs,e > 0 and that Δts,e is locally
decreasing at b. Observe that there has to be a change point ηk ∈ (s, b), or oth-
erwise Δbs,e > 0 implies that Δts,e is decreasing, as a consequence of Lemma B.1.
Thus, if s ≤ ηk ≤ b ≤ e, then
3/2 3/2
nmin 1/2 nmin
Δηs,e
k
≥ Δbs,e ≥ Ds,e
b
− γ ≥ (c1 − c2 )κmax
s,e δ
1/2
≥ (c1 /2)κmax
s,e δ ,
nmax nmax
(B.30)
where the second inequality follows from (B.27), and the second inequality fol-
lows from (B.26) and (B.29). Observe that e − s ≤ e0 − s0 ≤ CM δ and that
(s, e) has to contain at least one change point or otherwise maxs<t<e Δts,e = 0,
which contradicts (B.30).
Step 1. In this step, we are to show that
min{ηk − s, e − ηk } ≥ min{1, c21 }δ/16. (B.31)
Suppose that ηk is the only change point in (s, e). Then (B.31) must hold or
otherwise it follows from (B.5) in Lemma B.2, we have
c1 δ 1/2
Δηs,e
k
≤ κk n1/2
max ,
4
which contradicts (B.30).
Suppose (s, e) contains at least two change points. Then ηk − s < min{1, c21 }δ/16
implies that ηk is the most left change point in (s, e). Therefore it follows from
Lemma B.8 that
1/2
c1 nmax
Δηs,e
k
≤ Δηs,e
k+1
+ 2n1/2
s:ηk κk
4 nmin
1/2
c1 nmax δ 1/2
≤ max Δts,e + c1 n1/2
max κk
4 nmin s<t<e 4
1/2 1/2
c1 nmax c1 nmax δ 1/2
≤ t
max Ds,e + γ+ c1 n1/2
max κk
4 nmin s<t<e 4 nmin 4
≤ max t
Ds,e − γ,
s<t<e
where the first inequality follows from Lemma B.1, the second follows from
(B.32), and the fourth follows from (B.27). Note that (B.33) is a contradiction
with (B.30), therefore we have b ∈ (ηk , ηk + c1 δn2min n−2
max /32).
Optimal nonparametric change point analysis 1191
n9max −2 2
ηk + C κ γ < b, (B.34)
n10
min
where C > 0 is a sufficiently large constant. We are to show that this leads to
the bound that
' '2 ' '2
'Ys,e − Ps,e
b
Ys,e ' > 'Ys,e − Ps,e
ηk
μs,e ' , (B.35)
which is a contradiction.
We have min{ηk −s, e−ηk } ≥ min{1, c21 }δ/16 and |b−ηk | ≤ c1 δn2min n−2
max /32.
For properly chose c1 , we have
It holds that
' '2 ' '2
'Ys,e − Ps,e
b
Ys,e ' − 'Ys,e − Ps,e ηk
μs,e '
' '2 ' '2
='μs,e − Ps,e
b
μs,e ' − 'μs,e − Ps,e ηk
μs,e ' +
2Ys,e − μs,e , Ps,e
ηk
μs,e − Ps,e
b
Ys,e .
We are then to utilize the result of Lemma B.5. Note that z0 there can be
any z0 ∈ R satisfying conditions thereof. Equation (B.16) holds due to the fact
that here we have
η b 3/2
1/2 nmin
3/2
max 1/2 nmin
Δs,e
k
(z0 ) ≥Δbs,e (z0 ) ≥ Ds,e (z0 ) − γ ≥ c1 κmax
s,e δ − c κ
2 s,e δ
nmax nmax
3/2
nmin
≥c1 /2κmax
s,e δ
1/2
, (B.38)
nmax
where the first inequality follows from the fact that ηk is a true change point,
the second inequality from (B.27), the third inequality follows from (B.26) and
1192 O. H. Madrid Padilla et al.
(B.29), and the final inequality follows from the condition that 0 < c2 < c1 /2.
Towards this end, it follows from Lemma B.5 that
n2min ηk
Δηs,e
k
(z0 ) − Δbs,e (z0 ) ≥ c|b − ηk |δ Δ (z0 )(e − s)−2 . (B.39)
n2max s,e
Combining (B.37), (B.38) and (B.39), we have
' '2 ' '2 cc2 n5
'μs,e − Ps,e
b
μs,e ' − 'μs,e − Ps,e ηk
μs,e ' ≥ 1 δ 2 4min κ2 (e − s)−2 |b − ηk |.
4 nmax
(B.40)
The left-hand side of (B.36) can be decomposed as follows.
2Ys,e − μs,e , Ps,e
b
Ys,e − Ps,e
ηk
μs,e
=2Ys,e − μs,e , Ps,e
b
Ys,e − Ps,e
b
μs,e + 2Ys,e − μs,e , Ps,e
b
μs,e − Ps,e
ηk
μs,e
⎛ ⎞
ns:ηk
ns:b
ns:e
b
=(I) + 2 ⎝ + + ⎠ Ys,e − μs,e Ps,e
i
μs,e − Ps,e
ηk
μs,e i
i=1 i=ns:ηk +1 i=ns:b +1
(B.42)
where the inequality follows from the definition of the CUSUM statistics and
(B.27).
Term (II). It holds that
ns:ηk
1
ns:b
1
ηk
−1/2
1/2
(II.1) = 2ns:ηk ns:ηk (Ys,e − μs,e )i (μs,e )i − (μs,e )i .
i=1
ns:b i=1 ns:ηk i=1
3/2
nmax 4
≤2 δ −1/2 γ|b − ηk |(CM + 1)κmax
s,e . (B.43)
nmin min{1, c21 }
(II.2) ≤ 2n1/2
max |b − ηk |
1/2
γ(2CM + 3)κmax
s,e . (B.44)
The second inequality holds due to Condition 3.1, the third inequality holds due
to (B.34) and the first inequality is a consequence of the third inequality and
Condition 3.1.
Proof of Lemma 3.1. Let P0 denote the joint distribution of the independent
random variables {Yt,i }n,T
i=1,t=1 such that Y1,1 , . . . , Yδ,n are independent and iden-
tically distributed as δ0 and Yδ+1,1 , . . . , YT,n are independent and identically
distributed as δ1 , where δc , c ∈ R, is the Dirac distribution having point mass
at point c.
Let P1 denote the joint distribution of the independent random variables
{Zt,i }n,T
i=1,t=1 such that Z1,1 , . . . , ZT −δ,n are independent and identically dis-
tributed as δ1 and ZT −δ+1,1 , . . . , ZT,n are independent and identically distributed
as δ0 .
It holds that η(P0 ) = δ and η(P1 ) = T − δ. Since δ ≤ T /3, it holds that
1 − 1/2
inf sup EP |η̂ − η| ≥ (T /3) 1 − dTV (P0 , P1 ) ≥ (T /3) 1 − 2δn ≥ T,
η̂ P ∈P 3
where dTV (·, ·) is the total variation distance. In the last display, the first in-
equality follows from Le Cam’s lemma (see, e.g. Yu, 1997), and the second
inequality follows from Eq.(1.2) in Steerneman (1983).
Proof of Lemma 3.2. Let P0 denote the joint distribution of the independent
random variables {Yt,i }n,T
i=1,t=1 such that Y1,1 , . . . , Yδ,n are independent and iden-
tically distributed as F and Yδ+1,1 , . . . , YT,n are independent and identically
distributed as G.
1194 O. H. Madrid Padilla et al.
and ⎧
⎪
⎪ 0, x ≤ 0,
⎪
⎨(1 − 2κ)x, 0 < x ≤ 1/2,
G(x) =
⎪
⎪ (1/2 − κ) + (1 + 2κ)(x − 1/2), 1/2 < x ≤ 1,
⎪
⎩
1, x ≥ 1.
It holds that
sup |F (z) − G(z)| = κ,
z∈R
η(P0 ) = δ and η(P1 ) = δ+ξ. By Le Cam’s Lemma (e.g. Yu, 1997) and Lemma 2.6
in Tsybakov (2009), it holds that
ξ
inf sup EP |η̂ − η| ≥ ξ 1 − dTV (P0 , P1 ) ≥ exp (−KL(P0 , P1 )) , (C.1)
η̂ P ∈Q 2
where KL(·, ·) denotes the Kullback–Leibler divergence.
Since
nξ
KL(P0 , P1 ) = KL(P0i , P1i ) = log(1 − 4κ2 ) ≤ 2nξκ2 ,
2
i∈{δ+1,...,δ+ξ}
we have
ξ
inf sup EP |η̂ − η| ≥ exp(−2nξκ2 ).
η̂ P ∈Q 2
Set ξ = min{ nκ1 2 , T − 1 − δ}. By the assumption on ζT , for all T large
enough we must have that ξ = nκ1 2 . Thus, for all T large enough, using (C.1),
1 / 1 0 −2
inf sup EP |η̂ − η| ≥ max 1, e .
η̂ P ∈Q 2 nκ2
Proof. It follows from Theorem 3.1 and the proof thereof that applying Algo-
rithm 1 to {Wt,i } and the τ sequence defined in (3.7), with probability at least
24 log(n1:T ) 48T T M δ2
1− − − exp log − ,
T 3 n1:T n1:T log(n1:T )δ δ 16T 2
the event A, which is defined as follows holds.
Optimal nonparametric change point analysis 1195
A3 if τ < cτ,1 log1/2 (n1:T ), then the corresponding change point estimators
> K, and for any true change point ηk , there exits η̂ in the
satisfying K
estimators such that
Step 1. Let η̂0 = 0 and η̂K+1 = T . In this step, it suffices to show that for any
k ∈ {0, . . . , K − 1}, it holds that with large probability
k+1 nt
η̂l+1 2
1{Yt,i ≤ẑ} − F(η̂
Y
l +1):η̂l+1
(ẑ) +λ
l=k t=η̂l +1 i=1
nt
η̂k+2 2
< 1{Yt,i ≤ẑ} − F(η̂
Y
k +1):η̂k+2
(ẑ) . (D.1)
t=η̂k +1 i=1
it holds that
nt
η̂2 2
1
η̂l+1 nt 2
1{Yt,i ≤ẑ} − F1:η̂2 (ẑ) − 1{Yt,i ≤ẑ} − F(η̂l +1):η̂l+1 (ẑ)
t=1 i=1 l=0 t=η̂l +1 i=1
2 n3
η̂1
= D1,η̂ ({Yt,i }) ≥ c2τ,2 κ2 δ 2min , (D.2)
2
nmax
where the last inequality follows from the proof of Theorem 3.1.
1196 O. H. Madrid Padilla et al.
Step 2. In this step, we are to show with large probability, Algorithm 3 will not
over select. For simplicity, assume B2 = {η̂1 } and B1 = {η̂, η̂1 } with 0 < η̂ < η̂1 .
Let ẑ be the one defined in Algorithm 3 using the triplet {0, η̂, η̂1 }.
Since
nt
η̂1 2 nt
η̂ 2
1{Yt,i ≤ẑ} − F1:η̂
Y
1
(ẑ) − 1{Yt,i ≤ẑ} − F1:η̂
Y
(ẑ)
t=1 i=1 t=1 i=1
nt
η̂1 2 2
− 1{Yt,i ≤ẑ} − F(η̂+1):η̂
Y
1
(ẑ) η̂
= D0,η̂ 1
({Yt,i }) ≤ c2τ,1 log(n1:T )
t=η̂+1 i=1
Fig 6. Sensitivity of the tuning parameters in Algorithms 1 and 3. The left panel shows the
median of the estimated number of change points by Algorithms 1 and 3 under 50 Monte Carlo
simulations based on Scenario 2 in Section 4. The right panel shows plot the corresponding
for Scenario 3 in Section 4.
References
Pein, F., Hotz, T., Sieling, H. and Aspelmeier, T. (2019). stepR: Multi-
scale change-point inference R package version 2.0-4.
Preuss, P., Puchstein, R. and Dette, H. (2015). Detection of multiple
structural breaks in multivariate time series. Journal of the American Statis-
tical Association 110 654–668. MR3367255
Reinhart, A., Athey, A. and Biegalski, S. (2014). Spatially-aware temporal
anomaly mapping of gamma spectra. IEEE Transactions on Nuclear Science
61 1284–1289.
Rigaill, G. (2010). Pruned dynamic programming for optimal multiple change-
point detection. arXiv preprint 1004.0887 17.
Rizzo, M. L. and Székely, G. J. (2010). Disco analysis: A nonparametric
extension of analysis of variance. The Annals of Applied Statistics 4 1034–
1055. MR2758432
Russell, B. and Rambaccussing, D. (2018). Breaks and the statistical pro-
cess of inflation: the case of estimating the ‘modern’long-run Phillips curve.
Empirical Economics 1–21.
R Core Team (2019). R: A Language and Environment for Statistical Com-
puting R Foundation for Statistical Computing, Vienna, Austria.
Steerneman, T. (1983). On the total variation and Hellinger distance be-
tween signed measures; an application to product measures. Proceedings of
the American Mathematical Society 88 684–688. MR0702299
Tsybakov, A. (2009). Introduction to Nonparametric Estimation. Springer.
MR2724359
Vanegas, L. J., Behr, M. and Munk, A. (2019). Multiscale quantile regres-
sion. arXiv preprint 1902.09321.
Venkatraman, E. S. (1992). Consistency results in multiple change-point
problems, PhD thesis, Stanford University. MR2687536
Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of
Mathematical Statistics 16 117-186. MR0013275
Wang, T. and Samworth, R. J. (2018). High-dimensional changepoint esti-
mation via sparse projection. Journal of the Royal Statistical Society: Series
B (Statistical Methodology) 80 57–83. MR3744712
Wang, D., Yu, Y. and Rinaldo, A. (2018). Optimal change point detec-
tion and localization in sparse dynamic networks. arXiv preprint 1809.09602,
Annals of Statistics, to appear. MR4206675
Wang, D., Yu, Y. and Rinaldo, A. (2020). Univariate mean change point
detection: Penalization, cusum and optimality. Electronic Journal of Statistics
14 1917–1961. MR4091859
Wang, D., Yu, Y. and Rinaldo, A. (2021). Optimal Covariance Change Point
Detection in High Dimension. Bernoulli 27 554–575. MR4177380
Yao, Y. C. (1988). Estimating the number of change-points via Schwarz’ cri-
terion. Statistics & Probability Letters 6 181–189. MR0919373
Yao, Y.-C. and Au, S.-T. (1989). Least-squares estimation of a stop function.
Sankhyā: The Indian Journal of Statistics, Series A 370-381. MR1175613
Yao, Y. C. and Davis, R. A. (1986). The asymptotic behavior of the likeli-
hood ratio statistic for testing a shift in mean in a sequence of independent
Optimal nonparametric change point analysis 1201