Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

21 Ejs1809

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Electronic Journal of Statistics

Vol. 15 (2021) 1154–1201


ISSN: 1935-7524
https://doi.org/10.1214/21-EJS1809

Optimal nonparametric change point


analysis
Oscar Hernan Madrid Padilla

Department of Statistics, University of California, Los Angeles


Los Angeles, California 90095-1554
e-mail: oscar.madrid@stat.ucla.edu

Yi Yu

Department of Statistics, University of Warwick


Coventry CV4 7AL, U.K.
e-mail: yi.yu.2@warwick.ac.uk

Daren Wang

Department of ACMS, University of Notre Dame


Notre Dame, IN 46556 USA
e-mail: ddwang24@nd.edu

Alessandro Rinaldo

Department of Statistics & Data Science, Carnegie Mellon University


Pittsburgh, Pennsylvania 15213, U.S.A.
e-mail: arinaldo@cmu.edu
Abstract: We study change point detection and localization for univariate
data in fully nonparametric settings, in which at each time point, we ac-
quire an independent and identically distributed sample from an unknown
distribution that is piecewise constant. The magnitude of the distributional
changes at the change points is quantified using the Kolmogorov–Smirnov
distance. Our framework allows all the relevant parameters, namely the
minimal spacing between two consecutive change points, the minimal mag-
nitude of the changes in the Kolmogorov–Smirnov distance, and the number
of sample points collected at each time point, to change with the length of
the time series. We propose a novel change point detection algorithm based
on the Kolmogorov–Smirnov statistic and show that it is nearly minimax
rate optimal. Our theory demonstrates a phase transition in the space of
model parameters. The phase transition separates parameter combinations
for which consistent localization is possible from the ones for which this
task is statistically infeasible. We provide extensive numerical experiments
to support our theory.

MSC2020 subject classifications: Primary 62G05.


Keywords and phrases: Nonparametric, Kolmogorov–Smirnov statistic,
CUSUM, minimax optimality, phase transition.

Received June 2020.


1154
Optimal nonparametric change point analysis 1155

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155
1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 1156
1.2 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . 1158
2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159
3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1160
3.1 The consistency of the Kolmogorov–Smirnov detector algorithm 1160
3.2 Phase transition and minimax optimality . . . . . . . . . . . . 1163
3.3 Choice of tuning parameters . . . . . . . . . . . . . . . . . . . . 1164
4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . 1171
A Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173
B Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175
C Proofs of Section 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
D Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
E Sensitivity simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 1196
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197

1. Introduction

Change point analysis is a well-established topic in statistics that aims to de-


tect and localize abrupt changes in the data generating distributions in time
series data. Initiated during World War II (e.g. Wald, 1945), the field of change
point analysis has since then produced a large body of literature as well as host
of methods for statistical inference. These techniques are now widely used to
address important real-life problems in a wide range of disciplines, including
biology (e.g. Fan et al., 2015; Jewell et al., 2018), speech recognition (e.g. Fox
et al., 2011; Chowdhury, Selouani and O’Shaughnessy, 2012), social sciences
(e.g. Liu et al., 2013), climatology (e.g. Itoh and Kurths, 2010) and finance (e.g.
Preuss, Puchstein and Dette, 2015; Russell and Rambaccussing, 2018).
The theoretical understanding of the statistical challenges associated with
change point problems has also progressed considerably. The initial ground-
breaking results of Yao (1988), Yao and Au (1989) and Yao and Davis (1986)
that studied change point detection for a univariate piecewise-constant signal,
have now been extended in several ways. For instance, Fryzlewicz (2014), Frick,
Munk and Sieling (2014) and Kovács et al. (2020), among others, proposed
computationally-efficient methods to deal with situations of potentially mul-
tiple mean change points (e.g. Wang, Yu and Rinaldo, 2020). More recently,
Pein, Sieling and Munk (2017) constructed a method that can handle mean and
variance changes simultaneously. Chan, Yau and Zhang (2014), Cho (2016), Cho
and Fryzlewicz (2015) and Wang and Samworth (2018) studied high-dimensional
mean change point detection problems. A different line of work, including efforts
by Aue et al. (2009), Avanesov and Buzun (2016) and Wang, Yu and Rinaldo
1156 O. H. Madrid Padilla et al.

(2021), has investigated scenarios where covariance matrices change. Cribben


and Yu (2017), Liu et al. (2018) and Wang, Yu and Rinaldo (2018), among
others, inspected dynamic network change point detection problems.
Most of the existing theoretical frameworks for statistical analysis of change
point problems, however, rely heavily on strong modelling assumptions of para-
metric nature. Such approaches may be inadequate to capture the inherent
complexity of modern, high-dimensional datasets. Indeed, the statistical litera-
ture on nonparametric change point analysis is surprisingly limited compared to
its parametric counterpart. Among the nonparametric results, Carlstein (1988)
considered the scenario where there is at most one change point. Harchaoui and
Cappé (2007) utilized a kernel-based algorithm. Hawkins and Deng (2010) pro-
posed a Mann–Whitney-type statistic to conduct online change point detection.
Matteson and James (2014) established the consistency of change point estima-
tors based on statistics originally introduced in Rizzo and Székely (2010). Zou
et al. (2014) developed a nonparametric multiple change point detection method
with consistency guarantees. Li et al. (2019) proposed two computationally-
efficient kernel-based statistics for change point detection, which are inspired by
the B-statistics. Padilla et al. (2018) proposed an algorithm for nonparametric
change point detection based on the Kolmogorov–Smirnov statistic. Fearnhead
and Rigaill (2018) devised a mean change point detection method which is
robust to outliers. Vanegas, Behr and Munk (2019) constructed a multiscale
method for detecting changes in a fixed quantile of the distributions. Arlot,
Celisse and Harchaoui (2019) considered a kernel version of the cumulative sum
(CUSUM) statistic and an 0 -type optimization procedure which can be solved
by dynamic programming. Arlot, Celisse and Harchaoui (2019) also derived an
oracle inequality for their estimator, though without guarantees on change point
localization.
In this paper we advance both the theory and methodology of nonparametric
change point analysis by presenting a computationally-efficient procedure for
univariate change point localization. The resulting method is proven to be con-
sistent and in fact nearly minimax rate optimal for estimating the change points.
Our analysis builds upon various recent contributions in the literature on para-
metric and high-dimensional change point analysis but allows for a fully non-
parametric change point model. To the best of our knowledge, Zou et al. (2014)
is one of very few examples yielding a procedure for nonparametric change point
detection with provable guarantees on localization. A detailed comparison be-
tween our results and the ones of Zou et al. (2014) will be given in Appendix A,
which also includes detailed comparisons of our work with Pein, Sieling and
Munk (2017), Vanegas, Behr and Munk (2019) and Garreau and Arlot (2018).

1.1. Problem formulation

In this section we describe the change point model that we will consider. Our
notation and settings are fairly standard, with one crucial difference from most
of the contributions in the field: the changes in the underlying distribution at
Optimal nonparametric change point analysis 1157

the change points are not parametrically specified, but are instead quantified
through a nonparametric measure of distance between distributions. This fea-
ture renders our methods and analysis applicable to a wide range of change
point problems.
Condition 1.1 (Model). Let {Yt,i , t = 1, . . . , T, i = 1, . . . , nt } ⊂ R be a collec-
tion of independent random variables such that Yt,i ∼ Ft , where F1 , . . . , FT are
cumulative distribution functions.
Let {ηk }K+1
k=0 ⊂ {0, 1, . . . , T } be a collection of unknown change points with
1 = η0 < η1 < . . . < ηK ≤ T < ηK+1 = T + 1 such that Ft = Ft−1 , if and only
if t ∈ {η1 , . . . , ηK }.
The minimal spacing δ and the jump size κ are defined respectively as
mink=1,...,K+1 {ηk − ηk−1 } = δ > 0 and
 
min supFηk (z) − Fηk −1 (z) = min κk = κ > 0. (1.1)
k=1,...,K z∈R k=1,...,K

Furthermore, we set nmin = mint=1,...,T nt and nmax = maxt=1,...,T nt , and


assume that nmax  nmin  n for some n.
Throughout this paper, for any positive sequences aT and bT , we denote
aT  bT if there exists T0 ∈ N and an absolute constant C > 0 such that for
any T ≥ T0 , it holds aT ≥ CbT . We let aT  bT , if bT  aT . We let aT  bT , if
aT  bT and bT  aT .
We quantify the magnitude of the distributional changes between distribution
functions at two consecutive change points using the Kolmogorov–Smirnov dis-
tance, a natural and widely-used metric for univariate probability distributions.
Though stronger than the weak convergence, convergence in the Kolmogorov–
Smirnov distance is weaker than many other forms of convergence. Examples of
those stronger forms of convergence include the ones induced by the total varia-
tion distance and, provided that the distributions admit bounded Lebesgue den-
sities, by the L1 -Wasserstein distance. Nonetheless, reliance on the Kolmogorov–
Smirnov distance offers significant advantages: (1) it allows for fully nonpara-
metric change point analysis settings and procedures, that hold with minimal
assumptions on the underlying distribution functions and (2) is amenable to
statistical analysis. Although the Kolmogorov–Smirnov statistics is known to
exhibit low power in goodness-of-fit testing problems (e.g. Duembgen and Well-
ner, 2014), this issue does not affect our results: as shown below in Lemma 3.2,
our Kolmogorov–Smirnov-based change point localization procedure is nearly
minimax rate optimal. Our numerical experiments further confirm that our pro-
cedure compares favourably with other nonparametric methods.
In Condition 1.1, we allow for multiple observations nt to be collected at
each time t. This generalizes the classical change point detection framework
where nt = 1 for all t (e.g. Zou et al., 2014). This flexibility is inspired by
the recent interest in anomaly detection problems where multiple observations
can be measured at a fixed time (e.g. Chan et al., 2014; Reinhart, Athey and
Biegalski, 2014; Padilla et al., 2018). Our results remain valid even if nt = 1 for
all t.
1158 O. H. Madrid Padilla et al.

The nonparametric change point model defined above in Condition 1.1 is


specified by a few key parameters: the minimal spacing between two consecutive
change points δ, the minimal jump size in terms of the Kolmogorov–Smirnov
distance κ, and the number nt of data points acquired at each time t. We adopt a
high-dimensional framework in which all these quantities are allowed to change
as functions of the total number of time points T .
We consider the change point localization problem of establishing consistent

change point estimators {η̂k }K k=1 of the true change points. These are measurable
functions of data and return an increasing sequence of random time points
η̂1 < . . . < η̂K , such that, as T → ∞, the following event holds with probability
 
tending to 1: K  = K and maxk=1,...,K η̂k − ηk  ≤ , where  = (T ) is such that
limT →∞ /δ = 0.
Throughout the rest of the paper, we refer to the quantity /δ as the localiza-
tion rate. Our goal is to obtain change point estimators that, under the weakest
possible conditions, are guaranteed to yield the smallest possible .

1.2. Summary of results

We will show that under Condition 1.1, the hardness of the change point local-
ization task is fully captured by the quantity

κδ 1/2 n1/2 , (1.2)

which can be regarded as a signal-to-noise ratio of sort. We list our contributions


as follows.
• We demonstrate the existence of a phase transition for the localization
task in terms of the signal-to-noise ratio κδ 1/2 n1/2 . We show that in the
low signal-to-noise ratio regime κδ 1/2 n1/2  1, no algorithm is guaran-
teed to be consistent. We also show that if κδ 1/2 n1/2 ≥ ζT , where ζT is
any diverging sequence as T → ∞, then a minimax lower bound of the
localization error rate is of order (nκ2 T )−1 .
• We develop a novel detection procedure, based on the Kolmogorov–Smirnov
distance, given in Algorithm 1. We show that under suitable conditions,
our method is consistent and nearly minimax rate optimal, up to logarith-
mic factors, in terms of both the required signal-to-noise ratio. The spe-
cific assumption is given in Condition 3.1, and the localization error rate
in Theorem 3.1. Interestingly, for the lower bounds on the signal-to-noise
ratio and the localization error rate, our rates match those derived for the
univariate mean change point localization problem under sub-Gaussian
noise, e.g. Wang, Yu and Rinaldo (2020).
• We provide extensive comparisons of our algorithm and theoretical guar-
antees with several competing methods and results from the literature.
See Appendix A and Section 4, respectively. In particular, our simulations
indicate that our procedure performs well across a variety of scenarios, in
terms of both estimating the number of change points and their locations.
Optimal nonparametric change point analysis 1159

We point out that, although in deriving the theoretical guarantees for our
methodologies we follow techniques proposed in existing work, namely Venka-
traman (1992) and Fryzlewicz (2014), our results deliver improvements in two
aspects. Firstly, the extension to nonparametric settings, in which the mag-
nitude of the distributional changes is measured by the Kolmogorov–Smirnov
distance, requires novel and nontrivial arguments, especially to quantify the or-
der of the stochastic fluctuations of the associated CUSUM statistics. Secondly,
the arguments used in Fryzlewicz (2014) for the theoretical analysis of the per-
formance of the WBS algorithm have to be sharpened in order to allow for all
the model parameters to vary as the sample size diverges and in order to yield
optimal localization rates.

2. Methodology

In this section, we detail our Kolmogorov–Smirnov detector procedure, which


is based on the CUSUM Kolmogorov–Smirnov statistic defined next. Similar or
related Kolmogorov–Smirnov statistics based procedures have been considered
previously in Darkhovski (1994), Boukai and Zhou (1997) and Padilla et al.
(2018), among others.
Definition 2.1 (The CUSUM Kolmogorov–Smirnov statistic). For any integer
triplet (s, t, e), 1 ≤ s < t < e ≤ T , define the CUSUM Kolmogorov–Smirnov
statistic as  t 
t
Ds,e = supDs,e (z), (2.1)
z∈R

with 
ns:t n(t+1):e   
t
Ds,e (z) = Fs:t (z) − F(t+1):e (z) ,
ns:e
where for any integer pair (s, e) with s < e and z ∈ R, we write

1  
e nt e
Fs:e (z) = 1{Yt,i ≤z} and ns:e = nt .
ns:e t=s i=1 t=s

In Definition 2.1, ns:e is the total number of observations collected in the


integer interval [s, e], and Fs:e is the empirical cumulative distribution function
estimated using the data collected in [s, e].
The proposed procedure applies the wild binary segmentation procedure (Fry-
zlewicz, 2014). The latter was originally developed for univariate mean change
point detection problems based on the univariate CUSUM statistic. We, instead,
use the CUSUM Kolmogorov–Smirnov statistic. To be specific, we draw a col-
lection of random time intervals, and within each interval, we search the time
point which maximizes the CUSUM Kolmogorov–Smirnov statistic. If the corre-
sponding maximal value of such statistic exceeds an appropriate threshold, then
that time point is added to the collection of estimated change points. The pro-
cess is then repeated separately for each of the resulting two time sub-intervals
and stops when all the CUSUM Kolmogorov–Smirnov statistics are below the
1160 O. H. Madrid Padilla et al.

threshold, or when the resulting time interval is too narrow. See Algorithm 1
for details.

Algorithm 1 Kolmogorov–Smirnov Detector. KSD((s, e), {(αm , βm )}M


m=1 , τ )
INPUT: Sample {Yt,i }e,n t=s,i=1 ⊂ R, collection of intervals {(αm , βm )}m=1 and tuning pa-
t M

rameter τ > 0.
for m = 1, . . . , M do
(sm , em ) ← (s, e) ∩ (αm , βm )
if em − sm > 2 then
am ← maxt=sm +1,...,em −1 Dst m ,em
bm ← arg maxt=sm +1,...,em −1 Dst m ,em
else
am ← −1
end if
end for
m∗ ← arg maxm=1,...,M am
if am∗ > τ then
Add bm∗ to the set of estimated change points
KSD ((s, bm∗ ), {(αm , βm )}Mm=1 , τ )  Recursively call the function KSD
KSD ((bm∗ + 1, e), {(αm , βm )}M m=1 , τ )
end if
OUTPUT: The set of estimated change points.

In Algorithm 1, the main input is the threshold τ , a tuning parameter con-


trolling the number of returned change points, with larger values of τ producing
smaller numbers of estimated change points. Our theory in the next section will
shed some lights on how to choose τ .
For any integer triplet (s, t, e), 1 ≤ s < t < e ≤ T , the computational cost
of calculating Ds,et
is of order O{(e − s)ns:e log(ns:e )}. This can be seen using
a naı̈ve calculation based on the merge sort algorithm (e.g. Knuth, 1998), and
the fact that the supremum in (2.1) only needs to be taken over z ∈ {Yu,i : s ≤
u ≤ e, i = 1, . . . , nu }. Algorithm 1 therefore has the worst case running time
of order O{M T n1:T log(n1:T )} where M is the number of randomly drawn time
intervals.

3. Theory

3.1. The consistency of the Kolmogorov–Smirnov detector algorithm

We first state a condition involving the minimal spacing, the minimal jump
size and the total number of time points T , which overall amount to a minimal
signal-to-noise ratio condition.
Condition 3.1. There exists an absolute constant CSNR > 0 such that

κδ 1/2 n1/2 > CSNR φT log1/2 (n1:T ),

where φT is any diverging sequence as T grows unbounded.


Optimal nonparametric change point analysis 1161

The scaling exhibited in Condition 3.1 covers nearly all combinations of model
parameters for which the localization task is feasible: in Section 3.2 we show that
no estimator of the change points is guaranteed to be consistent when the model
parameters violate Condition 3.1, up to a poly-logarithmic term in T .
We then show that, under Condition 3.1 and with appropriate choices of
the input parameters, the Kolmogorov–Smirnov detector procedure will yield,
with high probability, the correct number of change points and a vanishing
localization rate. In fact, as shown below in Section 3.2, the resulting localization
rates are nearly minimax rate optimal (achieving the minimax risk, e.g. Section
2.1 in Tsybakov, 2009).
Theorem 3.1 (Guarantees for the Kolmogorov–Smirnov detector). Assume the
nt ,T
inputs of Algorithm 1 are as follows. (i) The sequence {Yt,i }i=1,t=1 satisfies Con-
ditions 1.1 and 3.1. (ii) The collection of intervals {(αm , βm )}Mm=1 ⊂ {1, . . . , T },
with endpoints drawn independently and uniformly from {1, . . . , T }, satisfy

max (βm − αm ) ≤ CM δ, (3.1)


m=1,...,M

almost surely, for an absolute constant CM > 0. (iii) The tuning parameter τ
satisfies
cτ,1 log1/2 (n1:T ) ≤ τ ≤ cτ,2 κδ 1/2 n1/2 , (3.2)
where cτ,1 , cτ,2 > 0 are absolute constants.

Let {η̂k }Kk=1 be the corresponding output of Algorithm 1. Then, there exists a
constant C > 0 such that
 
P K  = K and k ≤ C κ−2 log(n1:T )n−1 , ∀k = 1, . . . , K
k

24 log(n1:T ) 48T T M δ2
≥1 − − − exp log − , (3.3)
T 3 n1:T n1:T log(n1:T )δ δ 16T 2
where k = |η̂k − ηk | for k = 1, . . . , K.
If M  log(T /δ)T 2 δ −2 , then the probability in (3.3) approaches to 1 as
T → ∞, which shows that Algorithm 1 is consistent.
Based on Condition 3.1, the range of tuning parameters τ defined in (3.2) is
not empty, and the upper bound on the localization error rate satisfies
 k
= max  φ−2
T → 0, T → ∞,
δ k=1,...,K δ
where the inequality follows from Condition 3.1. It is important to point out that
Theorem 3.1 yields a family of localization rates that depend on how n, κ and δ
scale with T . The slow rate φ−2
T exhibited in the last display represents the worst
case scenario corresponding to the weakest possible signal-to-noise ratio afforded
in Condition 3.1. In fact, for most combinations of the model parameters, the
rate can be much faster. For instance, when κ is bounded away from zero, the
resulting rate is of order O log(n1:T )n−1 T −1 φT , if δ  T . Still assuming a non-
vanishing κ, and provided that n increases with T , our Kolmogorov–Smirnov
1162 O. H. Madrid Padilla et al.

detector estimator remains consistent


  even if δ is as small as log(n1:T ), with
a localization rate of order O n−1 . Finally, if the change points are evenly
spaced with δ  T 1/2 , then the number of change points satisfies K  T 1/2 . In
this case the Kolmogorov–Smirnov detector procedure will output a consistent
estimator of the change points even with κ tending to 0 as T increases, as long
as the rate of decay for κ is slower than φT log1/2 (n1:T )n−1/2 T −1/4 .
Remark 3.1 (Comments on (3.1)). The condition (3.1) trivially holds when
the number of change points K remains bounded as T → ∞, and we sample
the intervals {(αm , βm )}Mm=1 ⊂ {1, . . . , T } uniformly in {1, . . . , T }. However,
if K → ∞ as T grows, then (3.1) is somewhat unsatisfactory, as it assumes
some knowledge about the rate of growth of the minimal spacing δ. This may
not be available in practice, even though in many cases an educated guess on
the minimal spacing is not too unreasonable to assume. Though (3.1) does not
appear among the assumptions of Theorem 3.2 in Fryzlewicz (2014) about the
performance of the wild binary segmentation algorithm, it is implicitly assumed
there. More generally, some knowledge of δ, in one form of another, is rou-
tinely assumed in order to demonstrate consistency or optimality of wild binary
segmentation-like algorithms, see, e.g. Wang and Samworth (2018), Wang, Yu
and Rinaldo (2020), Baranowski, Chen and Fryzlewicz (2019), Anastasiou and
Fryzlewicz (2019) and Eichinger and Kirch (2018). For more discussions on
this, see Wang, Yu and Rinaldo (2020).
In the proofs, we randomly draw intervals (am , bm ) and let (αm , βm ) = (am , bm )
if bm − am ≤ CM δ, otherwise discard (am , bm ). This procedure continues un-
til we have M such intervals. In practice, as demonstrated in Section 4, there
are always only finite number of change points and we therefore do not need to
impose (3.1).
We remark that condition (3.1) is only needed to guarantee that the localiza-
tion rate of the Kolmogorov–Smirnov detector procedure given in (3.2) is nearly
minimax rate optimal, see the next section, but not to ensure consistency. In
fact, (3.1) is not at all necessary for consistency. For instance, assuming that

κ(δn)1/2 (δ/T )  φT log1/2 (n1:T ),

a slightly stronger signal-to-noise ratio setting than those allowed in Condi-


tion 3.1, a simple adaptation of the proof of Theorem 3.1 shows that, even with-
out (3.1), the Kolmogorov–Smirnov detector procedure will be consistent with a
vanishing localization rate /δ  log(n1:T )κ−2 n−1 T 2 δ −3 .
Theorem 3.1 delivers a strict improvement upon the guarantees claimed in
other nonparametric change point detection papers, since it guarantees localiza-
tion rates that are local. Indeed, according to Theorem 3.1, each change point
ηk possesses its own localization rate, depending on κk , the magnitude of the
corresponding distributional change. In particular, change points at which the
distributional change is more significant can be estimated more accurately.
The detailed proof of Theorem 3.1 can be found in Appendix B. Here we
provide a brief summary of the technical arguments used there. The first key
Optimal nonparametric change point analysis 1163

step is to obtain finite sample concentration inequalities to control the deviation


between the CUSUM Kolmogorov–Smirnov statistics {Ds,e t
: 1 ≤ s < t ≤
e ≤ T } and their population versions. The rest of the proofs are conducted
on the so-called good events, where with probabilities tending to 1 as T grows
unbounded, the fluctuations remain within an appropriate range. Next, we show
that the population versions of the CUSUM Kolmogorov–Smirnov statistics are
maximized at the true change points and, crucially, decay rapidly away from
them. Consequently, on the good events, our algorithm will correctly detect or
reject the existence of change points and localize the change points accurately.

3.2. Phase transition and minimax optimality

In this subsection, we prove that Algorithm 1 is optimal, in the sense of guar-


anteeing nearly minimax optimal localization rates across almost all models for
which the localization task is possible. Towards that end, recall that in The-
orem 3.1 we have shown that Algorithm 1 provides consistent change point
estimators under the assumption that
κ(δn)1/2  φT log1/2 (n1:T ). (3.4)
In Lemma 3.1 below, we will show that if
κ(δn)1/2  1, (3.5)
then no algorithm is guaranteed to be consistent uniformly over all possible
change point problems. In light of (3.4), (3.5), Theorem 3.1 and Lemma 3.1
below, we have found a phase transition over the space of model parameters
that separates scalings of the model parameters for which the localization task
is impossible, from those for which Algorithm 1 is consistent. The separation
between these two regions is sharp, saving for the term φT log1/2 (n1:T ).
Lemma 3.1. Let {Yt,i }n,T
i=1,t=1 be a time series satisfying Condition 1.1 with one
T
and only one change point. Let Pκ,n,δ denote the corresponding joint distribution.
   
−1 −2 −1
Set P = Pκ,n,δ : δ = min 2 κ n
T
, T /3 and, for each P ∈ P, let
η(P ) be the corresponding change point. Then,
 
inf sup EP η − η(P ) ≥ T /6,
 P ∈P
η

where the infimum is over all possible estimators of the change point locations.
In our next result, we derive a minimax lower bound on the localization
task, which applies to nearly all combinations of model parameters outside the
impossibility region found in Lemma 3.1.
Lemma 3.2. Let {Yt,i }n,T
i=1,t=1 be a time series satisfying Condition 1.1 with one
T
and only one change point. Let Pκ,n,δ denote the corresponding joint distribution.
Consider the class of distributions
 
Q = Pκ,n,δ
T
: δ ≤ T /2, κ < 1/2, κ(δn)1/2 ≥ ζT ,
1164 O. H. Madrid Padilla et al.

where {ζT } is any sequence such that limT →∞ ζT = ∞. Then, for all T large
enough, it holds that
    −1 
inf sup EP η − η(P ) ≥ max 1, 2e2 nκ2 ,
 P ∈Q
η

where the infimum is over all possible estimators of the change point locations.
Note that the condition δ ≤ T /2 automatically holds due to the definition
of δ. The above lower bound matches, saving for a poly-logarithmic factor,
the localization rate we have established in Theorem 3.1, thus showing that
Algorithm 1 is nearly minimax rate optimal.

3.3. Choice of tuning parameters

Algorithm 1 calls for three tuning parameters: the upper bound of random
interval widths CM , the number of random intervals M and the threshold τ .
In this subsection, we discuss the practical guidances on choosing the tuning
parameters.
The constant CM is involved in (3.1) and purely for theoretical purposes.
Since we allow the number of change points to diverge, to prompt the nearly
minimax optimality, the constant CM is required in Theorem 3.1. In practice,
there are only finite change points in any given data sets, hence (3.1) automat-
ically holds and this is not a tuning parameter in use.
For the number of random intervals M , as stated in Theorem 3.1, we need to
choose M to satisfy that M  log(T /δ)T 2 δ −2 . However, in practice, δ is likely
unknown but it holds that δ  T , which leads to M  1. In all the numerical
experiments in Section 4, we let M = 120, which works well in practice.
Arguably, the most important tuning parameter in Algorithm 1 is τ , whose
value determines whether a candidate time point should be deemed a change
point. If we let τ decrease from ∞ to 0, then the procedure produces more and
more change points. In particular, if all the other inputs, namely {Yt,i } and
{(αm , βm )}, are kept fixed, then it holds that B(τ1 ) ⊆ B(τ2 ), for τ1 ≥ τ2 , where
B(τ ) is the collection of estimated change points returned by Algorithm 1 when
a value of τ for the threshold parameter is used. We take advantage of such
nesting in order to design a data-driven method for picking τ .
To proceed, we now introduce Algorithm 2. Algorithm 2 is a generic procedure
that can be used for merging two collections of estimated change points B1
and B2 . Algorithm 2 deletes from B1 ∪ B2 potential false positives by checking
their validity one by one based on the CUSUM Kolmogorov–Smirnov statistics.
However, Algorithm 2 does not scan for potential false positives in the set B1 ∩B2 .
The criterion deployed in Algorithm 2 is based on the following check:


η nt 
 2 nt 
 
η̂k+1 2
1{Yt,i ≤ẑ} − F(η̂k +1):η (ẑ) + 1{Yt,i ≤ẑ} − F(η+1):η̂k+1 (ẑ)
t=η̂k +1 i=1 t=η+1 i=1
Optimal nonparametric change point analysis 1165

Algorithm 2 Model comparison


INPUT: Sample {Yt,i }T,n t=1,i=1 ⊂ R, candidate models B1 , B2 .
t

C ← (B2 \ B1 ) ∪ (B1 \ B2 )
nc ← |C|
B ← B1 ∩ B2
for i = 1, . . . , nc do
η ← ηi ∈ C
if η ∈ B2 \ B1 then
Set k to be the integer satisfying η ∈ (η̂k , η̂k+1 ), where {η̂k , η̂k+1 } ⊂ B1
else
Set k to be the integer satisfying η ∈ (η̂k , η̂k+1 ), where {η̂k , η̂k+1 } ⊂ B2
end if  
 η 
ẑ ← min arg max T ,nt D (z)
z∈{Y }
t,i t=1,i=1 η̂ ,η̂
k k+1

if (3.6) holds then


B ← B ∪ {η}
end if
end for
OUTPUT: B

nt 
 
η̂k+1 2
+λ< 1{Yt,i ≤ẑ} − F(η̂k +1):η̂k+1 (ẑ) , (3.6)
t=η̂k +1 i=1

where λ > 0 is a specified tuning parameter to be discussed in Theorem 3.2.


In order to have a practical scheme for selecting τ , we now propose Algo-
rithm 3. Algorithm 3 is a fully data-driven change point detection procedure
with automated tuning parameter selection. Algorithm 3 requires two indepen-
dent subsamples: {Yt,i } and {Wt,i }. In practice, one can perform sample split-
ting. If nt ≥ 2, for all t, then one can partition the data at every time point. If
all or some nt ’s equal 1, then for t ∈ {j : nj = 1}, we randomly assign the asso-
ciated observation to {Yt,i } or {Wt,i } with equal probability. In fact, there is no
need to ensure that both subsamples have exactly the same number of sample
points nt for all t. Our theoretical guarantees in Theorem 3.2 still hold as long as
the number of observations have the same scaling at each time point in the two
samples {Yt,i } and {Wt,i }. When calling (3.6) in Algorithm 3, the empirical dis-
tribution functions are constructed based on {Yt,i }. To present Algorithm 3, we
slightly abuse the notation: in order to emphasize that Algorithm 1 is conducted
on the sample {Wt,i }, we include {Wt,i } as a formal input to the Kolmogorov–
Smirnov detector algorithm. Since the CUSUM Kolmogorov–Smirnov statistics
are based on {Yt,i }, we use the notation Dη̂ηk ,η̂k+1 (z, {Yt,i }) instead.
As for the implementation of Algorithm 3, we arrange all the candidate sets
in increasing order of their corresponding τj values. This ensures a decreasing
nesting of the candidate change points. We begin with the set corresponding
to the smallest τj and compare consecutive sets. However, unlike Algorithm 2,
we pick a single element from the difference sets, and decide to move on or to
terminate the procedure. Theorem 3.2 provides suitable conditions guaranteeing
that this procedure results in a consistent estimator of the change points.
1166 O. H. Madrid Padilla et al.

Algorithm 3 Kolmogorov–Smirnov detector with tuning parameter selection


INPUT: Sample {Yt,i }T,n t=1,i=1 ⊂ R, collection of intervals {(αm , βm )}m=1 and tuning pa-
t M

rameter {τj }j=1 .


J

for j = 1, . . . , J do
Bj ← KSD((0, T ), {(αm , βm )}M m=1 , τj , {Wt,i })  Algorithm 1
O ← BJ
end for
for j = 1, . . . , J − 1 do
if Bj+1 = O then
η ← min{x : x ∈ O \ Bj+1 }
Set k to be the integer satisfying η ∈ (η̂k , η̂k+1 ), where  {η̂k , η̂k+1 } ⊂ Bj+1
 η 
ẑ ← min arg max T ,nt D (z, {Y t,i }) 
z∈{Y }
t,i t=1,i=1 η̂k ,η̂k+1
if (3.6) holds then
O ← Bj+1
else
Terminate the algorithm
end if
else
O ← Bj+1
end if
end for
OUTPUT: O

nt ,T
Theorem 3.2. Suppose that the following holds. (i) The sequences {Yt,i }i=1,t=1 ,
nt ,T
{Wt,i }i=1,t=1 are independent and satisfy Conditions 1.1 and 3.1. (ii) The col-
m=1 ⊂ {1, . . . , T }, whose endpoints are drawn
lection of intervals {(αm , βm )}M
independently and uniformly from {1, . . . , T }, satisfy maxm=1,...,M (βm − αm ) ≤
CM δ, almost surely, for an absolute constant CM > 1. (iii) The tuning param-
eters {τj }Jj=1 satisfy

τJ > . . . > cτ,2 κδ 1/2 n1/2 > . . . > τj ∗ > . . . > cτ,1 log1/2 (n1:T ) > . . . > τ1 , (3.7)

where cτ,1 , cτ,2 > 0 are absolute constants and for some j ∗ ∈ {2, . . . , J − 1}.
Let B = {η̂1 , . . . , η̂K̂ } be the output of Algorithm 3 with inputs satisfying the
conditions above. If λ = C log(n1:T ), with a large enough constant C > 0, then
 
P K  = K and k ≤ C κ−2 log(n1:T )n−1 , ∀k = 1, . . . , K
k

48 log(n1:T ) 96T T M δ2
≥1 − − − exp log − .
T 3 n1:T n1:T log(n1:T )δ δ 16T 2
The proof of Theorem 3.2 can be found in Appendix D. It implicitly as-
sumes that the nested sets {Bj } in Algorithm 3 satisfy |Bj \ Bj+1 | = 1, for
j = 1, . . . , J. If this condition is not met, then the conclusion of Theorem 3.2
still holds provided that we replace the inequality condition in Algorithm 3, with
the inequality λ > maxm=1,...,M supz∈R |Daηm ,bm (z, {Yt,i })|2 , where (am , bm ) =
ηk , ηk+1 ) ∩ (αm , βm ) for m = 1, . . . , M .
(
A similar proposal for selecting the threshold tuning parameter for the wild
binary segmentation algorithm can be found in Theorem 3.3 in Fryzlewicz
Optimal nonparametric change point analysis 1167

(2014). The proof of our Theorem 3.2 delivers a more careful and precise anal-
ysis.
Finally, Theorem 3.2 suggests to choose the parameter λ in Algorithm 3 as
λ = C log(n1:T ). In practice, we set C = 2/3 and find this choice performing
reasonably well. We also include further simulations in Appendix E showing
that Algorithm 3 is less sensitive to the parameter C than Algorithm 1 is to the
parameter τ .

4. Numerical experiments

4.1. Simulations

In this section we present the results of various simulation experiments aimed to


assess the performances of our method in a wide range of scenarios and in rela-
tion to other competing methods. The R (R Core Team, 2019) code used in our
simulations is available at https://github.com/hernanmp/NWBS. We measure
the performance of an estimator K  of the true number of change points K by

the absolute error |K − K|. In all our examples, we report the average absolute
errors over 100 repetitions. Furthermore, denoting by C = {η1 , . . . , ηK } the set
of true change points, the performance of an estimator C of C is measured by the
 = maxη∈C min  |x − η|. By convention we
one-sided Hausdorff distance d(C|C) x∈C
set it to be infinity when Cˆ is an empty set. If C = {1, . . . , T }, then d(C|C)
 = 0.

Thus, d(C|C) can be insensitive to overestimation. To overcome this, we also cal-
 In all of our simulations, for a method that produces an estimator
culate d(C|C).
  and d(C|C)
C, we report the median of both d(C|C)  over 100 repetitions.

Fig 1. A plot showing the densities considered in Scenario 1 (Padilla et al., 2018).

We start by focusing on the case in which nt = 1 for all t = 1, . . . , T . The


methods that we benchmark against are the following: wild binary segmen-
tation (Fryzlewicz, 2014; Baranowski and Fryzlewicz, 2019), Bai and Perron’s
method (Bai and Perron, 2003; Zeileis et al., 2002), pruned dynamic program-
ming (Rigaill, 2010; Cleynen, Rigaill and Koskas, 2016), pruned exact linear
time algorithm (Killick, Fearnhead and Eckley, 2012; Killick and Eckley, 2014),
the simultaneous multiscale change point estimator of (Frick, Munk and Sieling,
1168 O. H. Madrid Padilla et al.

Fig 2. A plot showing realizations of different scenarios with T = 8000. From left to right
and from top to bottom, the panels are from Scenarios 2, 3, 4 and 5, respectively.

2014; Pein et al., 2019), heterogeneous simultaneous multiscale change point es-
timators Pein, Sieling and Munk (2017); Pein et al. (2019), the nonparametric
multiple change point detection of Zou et al. (2014) and Haynes, Fearnhead and
Eckley (2017), the robust functional pruning optimal partitioning method of
(Fearnhead and Rigaill, 2018) and the kernel change point detection procedure
of (Celisse et al., 2018; Arlot, Celisse and Harchaoui, 2019). For all the compet-
ing methods, we set the respective tuning parameters by default choices. For the
kernel change point detection procedure, we use the function KernSeg MultiD
in the R (R Core Team, 2019) package kernseg (Marot et al., 2018). The func-
tion produces a sequence of candidate models, each of which is associated with
a measure of fit. We choose the best model based on the elbow criterion from
the sequence of measures of fit.
We apply Algorithm 3 with λ = 2/3 log(n1:T ), a choice that is guided by
Theorem 3.2 and that we find reasonable in practice (see Appendix E for sensi-
tivity simulations comparing Algorithms 1 and 3). Moreover, we construct the
samples {Yt,i } and {Wt,i } by splitting the data into two time series having odd
and even time indices. We set the number of random intervals as M = 120.
We explain the different generative models that are deployed in our sim-
ulations. For all scenarios, we consider T ∈ {1000, 4000, 8000}. Moreover, we
consider the partition P of {1, . . . , T } induced by the change points η0 = 1 <
η1 < . . . < ηK < ηK+1 = T + 1, which are evenly spaced in {1, . . . , T }. The ele-
ments of P are A1 , . . . , AK+1 , with Aj = [ηj−1 , ηj −1]. We consider the following
scenarios.
Scenario 1. Let K = 7 for each instance of T . Define Ft to have probability
density function as in the left panel of Figure 1 for t ∈ Aj with odd j and as in
the right panel of Figure 1 for t ∈ Aj with even j.
Scenario 2. Let K = 2−1/2 T 1/2 log−1/2 (T ) and define θ ∈ RT as θt = 1,
t ∈ Aj with odd j, and θt = 0 otherwise. Let data be generated as yt =
Optimal nonparametric change point analysis 1169

θt + 3−1/2 εt , t = 1, . . . , T , where εt are independent and identically distributed


as a t-distribution with 3 degrees of freedom.
Scenario 3. Replace the t-distribution in Scenario 2 with N (0, 1).
Scenario 4. Let K = 2−1/2 T 1/2 log−1/2 (T ) and define θ ∈ RT as θt = 1/5,
t ∈ Aj with odd j, and θt = 1 otherwise. Let data be generated as yt = θt εt ,
t = 1, . . . , T , where εt are independent and identically distributed as N (0, 1),
t = 1, . . . , T .
Scenario 5. Let K = 2 and the data be generated as yt = εt , t ∈ Aj with
odd j, and yt = 5−1/2 ξt otherwise, where εt are independent and identically
distributed as N (0, 1) and ξt are independent and identically distributed as
t-distribution with 2.5 degrees of freedom.
A visualization of different scenarios is given in Figure 2. We can see that
these five scenarios capture a broad range of models that can allow us to assess
the quality of different methods.
Based on the results in Tables 1 and 2, we can see that, generally, the best
performance is attained by the Kolmogorov–Smirnov detector and kernel change
point detection. In fact in some cases in terms of the localization error rate, the
kernel change point detection outperforms the Kolmogorov–Smirnov detector,
despite that to the best of our knowledge, the theoretical guarantees of the
kernel change point detection’s localization error rate is not yet established.
This is seen in Scenario 1, where, as the sample size grows, the Kolmogorov–
Smirnov detector and robust functional pruning optimal partitioning provide
the best estimates of K. This is not surprising as Scenario 1 presents a situa-
tion where the distributions are not members of usual parametric families. In
Scenario 2, robust functional pruning optimal partitioning attains the best per-
formance with heterogeneous simultaneous multiscale change point estimators,
kernel change point detection and Kolmogorov–Smirnov detector as the clos-
est competitors. This scenario poses a challenge for most methods due to the
heavy tails nature of the t-distributions. In Scenario 3, Kolmogorov–Smirnov
detector is outperformed by methods like wild binary segmentation, pruned dy-
namic programming, simultaneous multiscale change point estimators and Bai
and Perron’s method. While Kolmogorov–Smirnov detector is still competitive
in estimating the number of change points K, its localization errors are subpar.
This should not come as a surprise, since all these methods are designed to
work well in this particular mean change point model with sub-Gaussian errors.
In Scenario 4, we see a clear advantage of using one of the nonparametric ap-
proaches, such as kernel change point detection, Kolmogorov–Smirnov detector
and nonparametric multiple change point detection. Methods like wild binary
segmentation, pruned exact linear time algorithm, Bai and Perron’s procedure,
simultaneous multiscale change point estimators and heterogeneous simultane-
ous multiscale change point estimators perform poorly in this scenario. This
is especially interesting for the heterogeneous simultaneous multiscale change
point estimator, as such method has been proven to be effective in detecting
changes in variance also when there are concurrent changes in mean. However,
Scenario 4 only includes changes in variance with the mean remaining constant.
1170 O. H. Madrid Padilla et al.

Table 1
KSD, Kolmogorov–Smirnov detector; WBS, wild binary segmentation; PELT, pruned exact
linear time algorithm; S3IB, pruned dynamic programming algorithm; NMCD,
nonparametric multiple change point detection; SMUCE, simultaneous multiscale change
point estimators; B&P, Bai and Perron’s method; HSMUCE, heterogeneous simultaneous
multiscale change point estimators; RFPOP, robust functional pruning optimal partitioning
methods; KCP, kernel change point detection.

Scenario 1.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 1.5 11.0 4.35 6.3 1.8 4.35 6.85 1.6 1.8 4.4
ˆ
1000 d(C|C) 32.0 63.0 64.5 ∞ 23.0 43.5 ∞ 75.5 13.0 637.0
1000 d(C|C)ˆ 52.0 82.0 90.0 −∞ 36.0 93.5 −∞ 56.5 17.0 23.0
4000 |K − K̂| 0.2 0.5 21.4 6.9 2.2 52.0 4.9 8.3 0.24 1.4
ˆ
4000 d(C|C) 31.0 43.5 46.5 ∞ 13.5 448.0 1386.0 70.0 10.0 108.0
4000 d(C|C)ˆ 31.0 364.5 448 −∞ 69.5 32.7 101.5 319.0 12.0 27.0
8000 |K − K̂| 0.0 0.9 41.2 5.35 3.25 62.7 3.7 17.6 0.24 0.9
ˆ
8000 d(C|C) 38.0 29.5 55.0 ∞ 10.0 70.0 1831 84.5 11.0 24.0
8000 d(C|C)ˆ 38.0 818.5 955.0 −∞ 227.0 958.0 206.5 910.0 11.0 34.0
Scenario 2.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 1.3 9.9 2.7 1.85 1.7 4.6 7.05 0.45 0.1 1.76
ˆ
1000 d(C|C) 11.0 8.5 9.0 15.5 7.0 10.5 738.5 23.0 6.0 12
1000 d(C|C)ˆ 13.0 54.5 45.5 29.5 28.0 53.5 16.0 22.0 6.0 5.0
4000 |K − K̂| 0.0 15.4 10.05 14.7 3.35 14.8 11.95 0.0 0.1 0.6
ˆ
4000 d(C|C) 16.0 9.5 12.0 ∞ 9.0 35.0 1007 16.0 8.0 8.0
4000 d(C|C)ˆ 16.0 176 163.0 −∞ 107.5 164 20.0 16.0 8.0 8.0
8000 |K − K̂| 1.3 8.4 18.45 20.8 4.75 28.5 18.4 0.1 0.2 0.0
ˆ
8000 d(C|C) 363.0 15.5 9.5 ∞ 13.5 40.0 2179 20.0 9.0 7.0
8000 d(C|C)ˆ 18.0 254.5 2179 −∞ 129.5 257 19.5 20.0 10.0 7.0
Scenario 3.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 0.8 0.0 0.0 0.1 0.8 0.0 0.0 0.2 0.0 2.3
ˆ
1000 d(C|C) 16.0 9.0 8.5 8.5 9.5 8.5 8.5 9.5 9.0 140.0
1000 d(C|C)ˆ 19.0 9.5 8.5 0.9 10.5 8.5 8.5 9.5 9.0 7.0
4000 |K − K̂| 0.1 0.0 0.0 0.0 1.8 0.0 0.0 0.0 0.1 0.2
ˆ
4000 d(C|C) 22.0 8.0 9.0 8.0 11.5 8.0 8.0 9.5 8.0 11.0
4000 d(C|C)ˆ 20.0 8.0 9.0 8.0 136.5 8.0 8.0 8.0 8.0 9.0
8000 |K − K̂| 0.2 0.0 0.0 0.1 1.7 0.0 0.0 0.2 0.0 0.2
ˆ
8000 d(C|C) 11.5 6.0 6.0 6.0 8.5 6.0 6.0 6.0 6.0 6.0
8000 d(C|C)ˆ 11.5 6.0 6.0 6.0 353 6.0 6.0 6.5 6.0 6.0

Finally, Scenario 5 seems to be the most challenging one for all methods. In fact,
Kolmogorov–Smirnov detector and kernel change point detection seem to be the
only methods capable of estimating correctly the numbers of change points, with
the kernel change point detection procedure yielding smaller localization rates.
In our second set of simulations we study the case where the number of
data points collected at any time can be more than one. We consider the same
5 scenarios, same tuning parameter selection method and same performance
Optimal nonparametric change point analysis 1171

Table 2
KSD, Kolmogorov–Smirnov detector; WBS, wild binary segmentation; PELT, pruned exact
linear time algorithm; S3IB, pruned dynamic programming algorithm; NMCD,
nonparametric multiple change point detection; SMUCE, simultaneous multiscale change
point estimators; B&P, Bai and Perron’s method; HSMUCE, heterogeneous simultaneous
multiscale change point estimators; RFPOP, robust functional pruning optimal partitioning
methods; KCP, kernel change point detection.

Scenario 4.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P HSMUCE RFPOP KCP
1000 |K − K̂| 0.9 4.0 9.8 4.9 2.45 27.75 5.0 4.7 13.8 0.4
ˆ
1000 d(C|C) 36.0 ∞ 40.0 ∞ 4.0 24.5 ∞ ∞ 82.0 5.0
1000 d(C|C)ˆ 32.0 −∞ 153.5 −∞ 67.0 157 −∞ −∞ 157.0 5.0
4000 |K − K̂| 0.0 3.8 36.3 5.0 2.7 71.1 5.0 4.5 44.8 0.1
ˆ
4000 d(C|C) 19.0 ∞ 106.5 ∞ 4.5 46.0 ∞ ∞ 125.0 5.0
4000 d(C|C)ˆ 19.0 −∞ 644.5 −∞ 66.0 651.5 −∞ −∞ 647.0 5.0
8000 |K − K̂| 0.1 3.5 60.3 5.0 4.0 109.3 5.0 4.5 71.4 0.0
ˆ
8000 d(C|C) 23.0 6301.5 115.0 ∞ 2.5 47.5 ∞ ∞ 135 6.0
8000 d(C|C)ˆ 28.0 238 1300.5 −∞ 566.5 1316 −∞ −∞ 1293 6.0
Scenario 5.
T Metric KSD WBS PELT S3IB NMCD SMUCE B&P H-SMUCE RFPOP KCP
1000 |K − K̂| 0.4 3.6 7.2 5.0 1.5 26.95 5.0 4.8 1.96 0.1
ˆ
1000 d(C|C) 27.0 ∞ 70.0 ∞ 4.5 21.0 ∞ ∞ ∞ 9.0
1000 d(C|C)ˆ 29.0 −∞ 147.0 −∞ 32 159.0 −∞ −∞ −∞ 8.0
4000 |K − K̂| 0.1 3.5 38.2 5.0 3.35 72.05 5.0 4.5 2.0 0.1
ˆ
4000 d(C|C) 24.0 ∞ 82.0 ∞ 3.0 39.5 ∞ ∞ ∞ 12.0
4000 d(C|C)ˆ 25.0 −∞ 629.5 −∞ 275 640.5 −∞ −∞ −∞ 12.0
8000 |K − K̂| 0.0 4.2 63.9 5.0 4.4 114.0 5.0 4.6 1.84 0.1
ˆ
8000 d(C|C) 37.0 ∞ 107.5 ∞ 2.5 55.0 ∞ ∞ ∞ 12.0
8000 d(C|C)ˆ 37.0 −∞ 1309 −∞ 552 1310.5 −∞ −∞ −∞ 12.0

metrics as in the first set of simulations. However, instead of setting nt = 1, we


let nt to be fixed as 5, 15 and 30 or to be randomly distributed as Poisson(5),
Poisson(15) and Poisson(5), for each t. We also set T = 1000. We only present
the results of the Kolmogorov–Smirnov detector, because no other methods
would automatically be able to handle this situation. We omit d(C|C)  as it does
not provide additional information. The results on Table 3 show the effectiveness
of the Kolmogorov–Smirnov detector for estimating the number of change points
and their locations.

4.2. Real data analysis

We consider the array comparative genomic hybridization micro-array data set


from Bleakley and Vert (2011). This consists of individuals with bladder tu-
mours. The data set has been processed and can be obtained in the R (R Core
Team, 2019) package ecp (Matteson and James, 2013). For the microarray cor-
responding to the first individual we consider change point localization using
different methods.
1172 O. H. Madrid Padilla et al.

Table 3
Performance evaluations for the KSD method in settings where nt can be larger than 1.

Scenario 1.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.1 0.2 0.0 0.6 0.3 0.0
ˆ
d(C|C) 6.5 2.5 2.0 6.0 2.0 1.5
Scenario 2.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.1 0.0 0.0 0.4 0.0 0.0
ˆ
d(C|C) 3.0 1.0 0.0 3.0 1.0 0.0
Scenario 3.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.3 0.3 0.0 0.4 0.0 0.0
ˆ
d(C|C) 6.5 1.0 0.5 5.0 2.0 1.0
Scenario 4.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.2 0.0 0.0 0.0 0.0 0.0
ˆ
d(C|C) 6.0 2.0 0.0 5.0 1.0 1.0
Scenario 5.
Metric nt = 5 nt = 15 nt = 30 nt ∼ Pois(5) nt ∼ Pois(15) nt ∼ Pois(30)
|K − K̂| 0.1 0.0 0.0 0.0 0.0 0.0
ˆ
d(C|C) 9.5 3.0 2.0 6.0 5.5 6.0

Fig 3. A plot showing individual 1 in the array comparative genomic hybridization data set,
with estimated change points indicating by vertical lines. From left to right, the panels are
based on the estimators from the Kolmogorov–Smirnov detector, wild binary segmentation,
nonparametric multiple change point detection and robust functional pruning optimal parti-
tioning methods, respectively.

Figure 3 shows that all the methods seem to recover the important change
points in the time series associated with individual 1 in the data set. A po-
tential advantage of the Kolmogorov–Smirnov detector method is that it seems
less sensitive to potentially-spurious change points as opposed to wild binary
segmentation, nonparametric multiple change point detection and robust func-
tional pruning optimal partitioning.
Optimal nonparametric change point analysis 1173

Appendix A: Comparisons

We compare our rates with those in the univariate mean change point detec-
tion problem, which assumes sub-Gaussian data (e.g. Wang, Yu and Rinaldo,
2020). On one hand, this comparison inherits the main arguments when com-
paring parametric and nonparametric modelling methods in general. Especially
with the general model assumption we impose on the underlying distribution
functions, we enjoy risk-free from model mis-specification. On the other hand,
seemingly surprisingly, we achieve the same rates of those in the univariate
mean change point detection case, even though sub-Gaussianity is assumed
thereof. In fact, this is expected. We are using the empirical distribution function
in our CUSUM Kolmogorov–Smirnov statistic, which is essentially a weighted
Bernoulli random variable at every z ∈ R. Due to the fact that Bernoulli ran-
dom variables are sub-Gaussian, and that the empirical distribution functions
are step functions with knots only at the sample points, we are indeed to expect
the same rates.
Furthermore, the heterogeneous simultaneous multiscale change point es-
timator from Pein, Sieling and Munk (2017) can also be compared to the
Kolmogorov–Smirnov detector. Assuming Gaussian errors, δ  T and K =
O(1), Theorems 3.7 and 3.9 in Pein, Sieling and Munk (2017) proved that het-
erogeneous simultaneous multiscale change point estimator can consistently es-
timate the number of change points, and that   r(T ), for any r(T ) sequence
such that r(T )/ log(T ) → ∞. This is weaker than our upper bound that guar-
antees   log(T ). The Kolmogorov–Smirnov detector can handle changes in
variance when the mean remains constant, a setting where it is unknown if the
heterogeneous simultaneous multiscale change point estimator is consistent.
Another interesting contrast can be made between the Kolmogorov–Smirnov
detector and the multiscale quantile segmentation method in Vanegas, Behr
and Munk (2019). Both algorithms make no assumptions on the distributional
form of the cumulative distribution functions. However, the multiscale quantile
segmentation is designed to identify changes in one known quantile. This is
not a requirement for the Kolmogorov–Smirnov detector which can detect any
type of changes without previous knowledge of their nature. As for statistical
guarantees, translating to our notation, provided that δ  log(T ), multiscale
quantile segmentation can consistently estimate the number of change points
and have   log(T ). This matches our theoretical guarantees in Theorem 3.1.
We compare the theoretical properties of the Kolmogorov–Smirnov detector
with the ones of the nonparametric multiple change point detection procedure in
Zou et al. (2014). Both methods guarantee consistent change point localization
of univariate sequences in fully nonparametric settings.
• We measure the magnitude κ of the distribution changes at the change
points using the Kolmogorov–Smirnov distance, as in (1.1). In contrast,
Zou et al. (2014) deploy a weighted Kullback–Leibler divergence, see As-
sumption A4 in their paper, which is stronger than the Kolmogorov–
Smirnov distance, and therefore more discriminative. At the same time,
1174 O. H. Madrid Padilla et al.

the authors require the cumulative distribution functions to be continuous,


while our results hold with arbitrary distribution functions.
• We let κ change with T , thus allowing for the minimal jump size to decay
to 0, and provide localization rates that depend explicitly on κ. In contrast,
Zou et al. (2014) constraints the jump sizes to be bounded away from 0,
and the resulting localization rate does not involve the jump sizes.
• Zou et al. (2014) impose stronger conditions on the rate at which the
number of change point is permitted to diverge as a function of T . For
example, assuming equally spaced change points, the assumptions in their
Theorem 1 require, in our notation, that K = o(T 1/4 ). In contrast, we
allow for K to grow as fast as T / log(T ).
• If we let the number of change points be bounded in T , as assumed in
Theorem 2 of Zou et al. (2014), then, again translating into our notation,
their procedure yields   log2+c (T ), for c > 0, while we can achieve
  log(T ).

Finally, we compare with the KCP method, studied in Arlot, Celisse and
Harchaoui (2019) and Garreau and Arlot (2018). Specifically, the following dis-
cussions are based on translating Theorem 3.1 in Garreau and Arlot (2018) to
our notation, and assuming nt = 1, t ∈ {1, . . . , T } in our setting.

• Both ours and theirs study a nonparametric change point localization


problem, and both achieve parametric rates in the sense that the complex-
ity of the class of the distribution functions are not shown in the results. In
our case, this is because we adopt the Kolmogorov–Smirnov distance and
transform general distributions to Bernoulli distributions. In Garreau and
Arlot (2018), this is because they map the data to a reproducing kernel
Hilbert space.
• The jump size κ are measured by different measurements in ours and
theirs. In our paper, the jump size is taken to be the difference in terms
of the Kolmogorov–Smirnov distance, which reflects the genuine differ-
ences between different distributions. In Garreau and Arlot (2018), it is
a distance induced by the Hilbert space norm. More specifically, it is the
distance between two different Bochner integrals. With this definition, de-
spite the flexibility, as pointed out in Garreau and Arlot (2018), it is not
guaranteed that the change points detected are the change points of the
original distributions.
• In terms of the signal-to-noise ratio condition and the localization rates,
there are two differences between ours and theirs. Firstly, we allow all
model parameters to vary with the sample size n, while the variance pa-
rameter in Garreau and Arlot (2018) is assumed to be a constant. Secondly,
in both the signal-to-noise ratio condition (see Eq. (3.3) in Garreau and
Arlot, 2018) and the localization rate (see the final result in Theorem 3.1
in Garreau and Arlot, 2018), there is an additional factor K, the number
of change points. Note that this factor does not appear in our results.
Optimal nonparametric change point analysis 1175

Appendix B: Proof of Theorem 3.1

Definition B.1. Denote the population version of the CUSUM Kolmogorov–


Smirnov statistic as  
Δts,e = supΔts,e (z),
z∈R

where  1/2
ns:t n(t+1):e
Δts,e (z) = Fs:t (z) − F(t+1):e (z) , (B.1)
ns:e
and
1 
e
Fs:e (z) = nt Ft (z).
ns:e t=s

Proof of Theorem 3.1. Let  = C κ−2 log(n1:T )n7max n−8


min . Since  is the upper
bound of the localisation error, by induction, it suffices to consider any interval
(s, e) ⊂ (1, T ) that satisfies

ηk−1 ≤ s ≤ ηk ≤ . . . ≤ ηk+q ≤ e ≤ ηk+q+1 , q ≥ −1,

and
max min{ηk − s, s − ηk−1 }, min{ηk+q+1 − e, e − ηk+q } ≤ ,
where q = −1 indicates that there is no change point contained in (s, e).
By Condition 3.1, there exists an absolute constant c > 0 such that

n7max
≤c δ ≤ δ/4.
n7min

It has to be the case that for any change point ηk ∈ (0, T ), either |ηk − s| ≤  or
|ηk −s| ≥ δ − ≥ 3δ/4. This means that min{|ηk −s|, |ηk −e|} ≤  indicates that
ηk is a detected change point in the previous induction step, even if ηk ∈ (s, e).
We refer to ηk ∈ (s, e) an undetected change point if min{|ηk − s|, |ηk − e|} ≥
3δ/4.
In order to complete the induction step, it suffices to show that we (i) will not
detect any new change point in (s, e) if all the change points in that interval have
been previous detected, and (ii) will find a point b ∈ (s, e) such that |ηk − b| ≤ 
if there exists at least one undetected change point in (s, e).
For j = 1, 2, define the events
   
e 
nk
  
 (j)
Aj (γ) = max sup  wk 1{Yk,i ≤z} − E 1{Yk,i ≤z}  ≤ γ ,
1≤s<b<e≤T z∈R  
k=s i=1

where
⎧ 1/2

⎨ n(b+1):e , k = s, . . . , b,
and wk = n−1/2
(1) ns:b ns:e (2)
wk =  1/2 s:e .

⎩− ns:b
n(b+1):e ns:e , k = b + 1, . . . , e,
1176 O. H. Madrid Padilla et al.

Define

K
S= αs ∈ [ηk − 3δ/4, ηk − δ/2], βs ∈ [ηk + δ/2, ηk + 3δ/4],
k=1
for some s = 1, . . . , M .

Set γ to be Cγ log1/2 (n1:T ), with a sufficiently large constant Cγ > 0. The rest of
the proof assumes the event A1 (γ) ∩ A2 (γ) ∩ S, the probability of which can be
lower bounded using Lemma B.3 and also Lemma 13 in Wang, Yu and Rinaldo
(2020).

Step 1. In this step, we will show that we will consistently detect or reject
the existence of undetected change points within (s, e). Let am , bm and m∗ be
defined as in Algorithm 1. Suppose there exists a change point ηk ∈ (s, e) such
that min{ηk −s, e−ηk } ≥ 3δ/4. In the event S, there exists an interval (αm , βm )
selected such that αm ∈ [ηk − 3δ/4, ηk − δ/2] and βm ∈ [ηk + δ/2, ηk + 3δ/4].
Following Algorithm 1, (sm , em ) = (αm , βm ) ∩ (s, e). We have that min{ηk −
sm , em − ηk } ≥ (1/4)δ and (sm , em ) contains at most one true change point.
It follows from Lemma B.4, with c1 there chosen to be 1/4, that

 t  κδnmin
3/2
max Δs ≥ ,
m ,em
t=sm +1,...,em −1 8(em − sm )1/2 nmax
Therefore

am = max Dst m ,em ≥ max Δtsm ,em − γ


t=sm +1,...,em −1 t=sm +1,...,em −1
3/2
1 nmin
≥ 1/2
κδ 1/2 − γ.
8CM nmax

Thus for any undetected change point ηk ∈ (s, e), it holds that
3/2
1 nmin
a m∗ = max am ≥ 1/2
κδ 1/2 − γ ≥ cτ,2 κδ 1/2 n1/2 , (B.2)
m=1,...,M 8CM nmax

where the last inequality is from the choice of γ, the fact nmin  nmax and
cτ,2 > 0 is achievable with a sufficiently large CSNR in Condition 3.1. This
means we accept the existence of undetected change points.
Recalling the notation {k }K
k=1 we introduced in (3.3). Here with some abuse
of notation, we let

k = C κ−2
k log(n1:T )n
−1
, k = 1, . . . , K.

Suppose that there are no undetected change points within (s, e), then for any
(sm , em ), one of the following situations must hold.
(a) There is no change point within (sm , em );
Optimal nonparametric change point analysis 1177

(b) there exists only one change point ηk ∈ (sm , em ) and min{ηk − sm , em −
ηk } ≤ k ; or
(c) there exist two change points ηk , ηk+1 ∈ (sm , em ) and ηk − sm ≤ k ,
em − ηk+1 ≤ k+1 .
Observe that if (a) holds, then we have

max Dst m ,em ≤ max Δtsm ,em + γ = γ < τ,


t=sm +1,...,em −1 t=sm +1,...,em −1

so no change points are detected.


Cases (b) and (c) are similar, and case (b) is simpler than (c), so we will only
focus on case (c). It follows from Lemma B.2 that

max Δtsm ,em ≤ n1/2


max (em − ηk+1 )
1/2
max (ηk − sm )
κk+1 + n1/2 1/2
κk
t=sm +1,...,em −1

≤ 2C1/2 log1/2 (n1:T ),

therefore

max Dst m ,em ≤ max Δtsm ,em + γ


t=sm +1,...,em −1 t=sm +1,...,em −1

≤ 2C1/2 log1/2 (n1:T ) + Cγ log1/2 (n1:T ) < τ.

Under (3.2), we will always correctly reject the existence of undetected change
points.

Step 2. Assume that there exists a change point ηk ∈ (s, e) such that min{ηk −
s, e − ηk } ≥ 3δ/4. Let sm , em and m∗ be defined as in Algorithm 1. To complete
the proof it suffices to show that, there exists a change point ηk ∈ (sm∗ , em∗ )
such that min{ηk − sm∗ , ηk − em∗ } ≥ δ/4 and |bm∗ − ηk | ≤ .
To this end, we are to ensure that the assumptions of Lemma B.9 are verified.
Note that (B.26) follows from (B.2), (B.27) and (B.28) follow from the definitions
of events A1 (γ) and A2 (γ), and (B.29) follows from Condition 3.1.
Thus, all the conditions in Lemma B.9 are met. Therefore, we conclude that
there exists a change point ηk , satisfying

min{em∗ − ηk , ηk − sm∗ } > δ/4 (B.3)

and
n9max −2 2
|bm∗ − ηk | ≤ C κ γ ≤ ,
n10
min
where the last inequality holds from the choice of γ and Condition 3.1.
The proof is completed by noticing that (B.3) and (sm∗ , em∗ ) ⊂ (s, e) imply
that
min{e − ηk , ηk − s} > δ/4 > .
As discussed in the argument before Step 1, this implies that ηk must be an
undetected change point.
1178 O. H. Madrid Padilla et al.

Below are a number of auxiliary lemmas. Lemma B.1 plays the role of Lemma
2.2 in Venkatraman (1992). Lemma B.3 controls the deviance between sam-
ple and population Kolmogorov–Smirnov statistics. Lemma B.4 is the density
version of Lemma 2.4 in Venkatraman (1992). Lemma B.5 plays the role of
Lemma 2.6 of Venkatraman (1992). Lemma B.6 is essentially Lemma 17 in
Wang, Yu and Rinaldo (2020). Lemma B.8 is Lemma 19 in Wang, Yu and Ri-
naldo (2020).
Lemma B.1. Under Condition 1.1, for any pair (s, e) ⊂ (1, T ) satisfying

ηk−1 ≤ s ≤ ηk ≤ . . . ≤ ηk+q ≤ e ≤ ηk+q+1 , q ≥ 0,

we have the following.


(i) Let
b1 ∈ arg max Δbs,e .
b=s+1,...,e−1

Then b1 ∈ {η1 , . . . , ηK }.
(ii) Let z ∈ arg maxx∈R |Δbs,e 1
(x)|. If Δts,e (z) = 0 for some t ∈ (s, e), then
|Δs,e (z)| is either monotonic or decreases and then increases within each
t

of the interval (s, ηk ), (ηk , ηk+1 ), . . . , (ηk+q , e).


Proof. We prove by contradiction assuming that b1 ∈ / {η1 , . . . , ηk }. Let z0 be
such that
z0 ∈ arg max |Δbs,e
1
(z)|.
z∈R

Note that due to the fact for any cumulative distribution function F : R → [0, 1],
it holds that F (−∞) = 1 − F (∞) = 0, we have that z0 ∈ R exists.
Therefore,
b1 ∈ arg max |Δbs,e (z0 )|.
b=s+1,...,e−1

Next consider the time series {rl (z0 )}nl=1


s:e
defined as


⎪Fs (z0 ) l ∈ {1, . . . , ns },

⎨F (z )
s+1 0 l ∈ {ns + 1, . . . , ns:(s+1) },
rl (z0 ) =
⎪. . . ,



Fe (z0 ) l ∈ {ns:(e−1) + 1, . . . , ns:e },

and for 1 ≤ l < ns:e define


1/2 
l  1/2 
ns:e
ns:e − l l
l
r̃1,n (z0 ) = rt (z0 ) − rt (z0 ).
s:e
ns:e l t=1
ns:e (ns:e − l)
t=l+1

The set of change points of the time series {rl (z0 )}nl=1
s:e
is

ns:ηk , . . . , ns:ηk+q .
Optimal nonparametric change point analysis 1179

Lemma 2.2 from Venkatraman (1992) applied to {rl (z0 )}nl=1


s:e
leading to that
n ns:η
Δbs,e
1
= Δbs,e
1 s:b1
(z0 ) = r̃1,n s:e
(z0 ) < max j
r̃1,ns:e (z0 )
j∈{k,...,k+q}

= max Δηs,e
j
(z0 ) ≤ max Δηs,e
j
,
j∈{k,...,k+q} j∈{k,...,k+q}

which is a contradiction.
As for (ii), it follows from applying Lemma 2.2 from Venkatraman (1992) to
{rl (z0 )}nl=1
s:e
.
Lemma B.2. Under Condition 1.1, let t ∈ (s, e). It holds that

Δts,e ≤ 2n1/2
max min{(s − t + 1)
1/2
, (e − t)1/2 }. (B.4)

If ηk is the only change point in (s, e), then

Δηs,e
k
≤ κk n1/2
max min{(s − ηk + 1)
1/2
, (e − ηk )1/2 }. (B.5)

If (s, e) ⊂ (1, T ) contains two and only two change points ηk and ηk+1 , then
we have

max Δts,e ≤ n1/2


max (e − ηk+1 )
1/2
κk+1 + nmax )1/2 (ηk − s)1/2 κk . (B.6)
t=s+1,...,e−1

Proof. As for (B.4), it follows from that


1/2 1/2
2ns:b n(b+1):e 1/2 1/2
Δbs,e ≤ 1/2 ≤ 2 min{ns:b , n(b+1):e }
ns:e
1/2
≤ 2nmax min{(s − b + 1)1/2 , (e − b)1/2 }.

As for (B.5), it is due to that


1/2 1/2
ns:ηk n(ηk +1):e  
Δηs,e
k
= 1/2 s:ηk supF (z) − F(ηk +1):e (z)
ns:e z∈R

≤ max min{(s − ηk + 1)
κk n1/2 1/2
, (e − ηk )1/2 }.

Equation (B.6) follows similarly.


The following Lemma provides a concentration result for the sample CUSUM
statistic around its population version. One natural way to proceed in its proof
is to exploit Dvoretzky–Kiefer–Wolfowitz’s inequality. However, given the pres-
ence of multiple change points, an immediate application of Dvoretzky–Kiefer–
Wolfowitz’s inequality produces an extra factor K in the upper bound in
Lemma B.3. We instead obtain a better upper bound with a different proof
technique.
Lemma B.3. Under Condition 1.1, for any 1 ≤ s < b < e ≤ T and z ∈ R,
define
Λbs,e (z) = Ds,e
b
(z) − Δbs,e (z),
1180 O. H. Madrid Padilla et al.

b
where Ds,e (z) and Δbs,e (z) are the sample and population versions of the
Kolmogorov–Smirnov statistic defined in Definition 2.1 and (B.1), respectively.
It holds that
 
  T4
1/2
pr max sup Λbs,e (z) > log + log(n1:T ) + 6 log1/2 (n1:T )
1≤s<b<e≤T z∈R 12δ

48 log(n1:T ) 12 log(n1:T ) 24T
+ 1/2
≤ 3n
+ .
n T 1:T n1:T log(n1:T )δ
1:T

Moreover
  
 
 −1/2  
e nt

pr max sup ns:e 1Yt,i ≤z − E(1Yt,i ≤z ) 
1≤s<e≤T z∈R  
t=s i=1
 1/2
T4 48 log(n1:T )
> log + log(n1:T ) + 6 log1/2 (n1:T ) + 1/2
12δ n 1:T
12 log(n1:T ) 24T
≤ 3
+ . (B.7)
T n1:T n1:T log(n1:T )δ
Remark B.1. Lemma B.3 shows that as T diverges unbounded, it holds that
   
max sup Λbs,e (z) = Op log1/2 (n1:T ) .
1≤s<b<e≤T z∈R

Proof of Lemma B.3. For any 1 ≤ s < b < e ≤ T and z ∈ R, let

ns:b n(b+1):e  
1/2 1/2

1/2
Fs:b (z) − F(b+1):e (z)
ns:e
1/2

b 
nk
n(b+1):e e nk 1/2
ns:b
= 1/2 1/2
1 {Yk,i ≤z} − 1/2
1
1/2 {Yk,i ≤z}
k=s i=1 ns:b ns:e k=b+1 i=1 n(b+1):e ns:e

e 
nk
= wk 1{Yk,i ≤z} ,
k=s i=1

where ⎧ 1/2

⎨ n(b+1):e
ns:b ns:e , k = s, . . . , b;
wk =  1/2 (B.8)

⎩− ns:b
n(b+1):e ns:e , k = b + 1, . . . , e.
Therefore, we have
  nk
ns:b n(b+1):e 1/2
e
 
Fs:b (z) − F(b+1):e (z) = wk E 1{Yk,i ≤z} ,
ns:e i=1
k=s
   
 e 
n k   e nk
 
  
b
Ds,e = sup  wk b
1{Yk,i ≤z}  and Δs,e = sup  wk E 1{Yk,i ≤z}  .
z∈R  i=1
 z∈R  
i=1
k=s k=s
Optimal nonparametric change point analysis 1181

Since
 
 e nk
    

b
Ds,e = sup  wk E 1{Yk,i ≤z} + 1{Yk,i ≤z} − E 1{Yk,i ≤z} 
z∈R  i=1

k=s
   
 e nk
   e nk
  
  
≤ sup  wk E 1{Yk,i ≤z}  + sup  wk 1{Yk,i ≤z} − E 1{Yk,i ≤z} 
z∈R  i=1
 z∈R 
i=1

k=s k=s
 
 e 
n k
  
 
= Δbs,e + sup  wk 1{Yk,i ≤z} − E 1{Yk,i ≤z}  ,
z∈R  
k=s i=1

we have
 
 b  e 
nk
  
Ds,e − Δbs,e  ≤ sup  wk 1{Yk,i ≤z} − E 1{Yk,i ≤z}

. (B.9)
z∈R  
k=s i=1

Next for z ∈ R define



e 
nk
 
Λbs,e (z) = wk 1{Yk,i ≤z} − E 1{Yk,i ≤z} ,
k=s i=1

and let {sk1 , . . . , skm−1 } ⊂ R satisfy

skj = Fk−1 (j/m),

where m is a positive integer to be specified. Let I1k = (−∞, sk1 ], Ijk = (skj−1 , skj ],
j = 2, . . . , m − 1, and Im k
= (skm−1 , ∞). With this notation, for any k ∈
{1, . . . , n}, we get a partition of R, namely Ik = {I1k , . . . , Im
k
}. Let I = ∩Tk=1 Ik =
{I1 , . . . , IM }. Note that there are at most T /δ distinct Ik ’s, and therefore
M ≤ T m/δ.
Let also zj be an interior point of Ij for all j ∈ {1, . . . , M }. Then
 
sup |Λbs,e (z)| ≤ max |Λbs,e (zj )| + sup |Λbs,e (zj ) − Λbs,e (z)| . (B.10)
z∈R j=1,...,M z∈Ij

By Hoeffding’s inequality and a union bound argument, we have for any ε > 0

2T 4 m  
P max max |Λbs,e (zj )| > ε ≤ exp −2ε2 , (B.11)
1≤s<b<e≤T j=1,...,M δ
since

e 
nk
wk2 = 1.
k=s i=1

On the other hand, for j ∈ {1, . . . , M }, let z ∈ Ij and without loss of gener-
ality assume that zj < z. Let

uj = |{(i, k) : k ∈ {1, . . . , T }, i ∈ {1, . . . , nk } and yk,i ∈ Ij }|.


1182 O. H. Madrid Padilla et al.

Let r(t) satisfy ηr(t)−1 + 1 ≤ t ≤ ηr(t) , t ∈ {1, . . . , T }. For Ij ∈ I, let q(j, k) be


r(k)
such that Ij ⊂ Iq(j,k) . Let

r(k)
vj = |{(i, k) : k ∈ {1, . . . , T }, i ∈ {1, . . . , nk } and yk,i ∈ Iv(j,k) }|.

It holds that uj ≤ vj and E(vj ) = n1:T /m.


We have,
 
 e  nk 
 
|Λs,e (zj ) − Λs,e (z)| ≤ 
b b
wk 1{yk,i ≤zj } − 1{yk,i ≤z} 
 
k=s i=1
 
 e  nk 
 
+ wk Fk (zj ) − Fk (z)}
 
k=s i=1
   
 e  nk 1/2 e  nk
 
≤ 1{zj <yk,i ≤z}  + |wk | max |Fk (z) − Fk (zj )|
  k=s,...,e
k=s i=1 k=s i=1
 1/2 1/2
1/2 2 n(b+1):e ns:b 1/2 2n
≤ max uj + ≤ max uj + 1:T . (B.12)
1≤j≤M m ns:e 1≤j≤M m

From the multiplicative Chernoff bound


  
3n1:T 3n1:T 3n1:T
P max uj ≥ ≤ M P uj ≥ ≤ M P vj ≥
1≤j≤M 2m 2m 2m
Tm  n1:T   n1:T 
< exp − ≤ exp − + log(T ) + log(m) − log(δ) . (B.13)
δ 12m 12m
Combining (B.10), (B.11), (B.12) and (B.13), we have
 
 b  3n
1/2
2n
1/2
sup Λs,e (z) >  +
1:T
P max + 1:T
1≤s<b<e≤T z∈R 2m m
2T 4 m  n 
1:T
≤ exp(−2ε2 ) + exp − + log(T ) + log(m) − log(δ) . (B.14)
δ 12m
Choosing

2T 4 m n1:T
ε = log1/2 and m= ,
δ 24 log(n1:T )

Eq. (B.14) results in


 
  T4
1/2
P max sup Λbs,e (z) > log + log(n1:T )
1≤s<b<e≤T z∈R 12δ

1/2 48 log(n1:T )
+ 6 log (n1:T ) + 1/2
n1:T
Optimal nonparametric change point analysis 1183

12 log(n1:T ) 24T
≤ + .
T 3 n1:T n1:T log(n1:T )δ

As for the result (B.7), we only need to change (B.8) to wk = (ns:e )−1/2 .
Lemma B.4. Under Condition 1.1, let 1 ≤ s < ηk < e ≤ T be any interval
satisfying
min{ηk − s, e − ηk } ≥ c1 δ,
with c1 > 0. Then we have that
3/2
c1 κδnmin c1 κδnmin
max Δts,e ≥ 1/2
≥ 1/2
.
t=s+1,...,e−1 2(e − s)1/2 nmax 2(e − s)1/2 nmax

Proof. Let
z0 ∈ arg max |Fηk (z) − Fηk+1 (z)|.
z∈R

Without loss of generality, assume that Fηk (z0 ) > Fηk+1 (z0 ). For s < t < e, it
holds that
  
1  1 
1/2 t e
ns:e ns:t
t
Δs,e (z0 ) = nl Fl (z0 ) − nl Fl (z0 )
n(t+1):e ns:t ns:e
l=s l=s
 1/2  t
ns:e
= nl Fl (z0 ),
ns:t n(t+1):e
l=s

e
where Fl (z0 ) = Fl (z0 ) − (ns:e )−1 l=s nl Fl (z0 ).
Due to Condition 1.1, it holds that Fηk (z0 ) > κ/2. Therefore


ηk
nl Fl (z0 ) ≥ (c1 /2)κnmin δ
l=s

and  1/2 1/2


ns:e nmin
≥ (e − s)−1/2 n−1/2
max ≥ .
ns:t n(t+1):e (e − s)1/2 nmax
Then
3/2
1 c1 κδnmin
max Δts,e ≥ ≥ .
t=s+1,...,e−1 1/2
2(e − s)1/2 nmax 2(e − s)1/2 nmax

In the following lemma, the condition (B.16) follows from Lemma B.4,and
(B.17) follows from Lemma B.3.
Lemma B.5. Let z0 ∈ R, (s, e) ⊂ (1, T ). Suppose that there exits a true change
point ηk ∈ (s, e) such that

min{ηk − s, e − ηk } ≥ c1 δ, (B.15)
1184 O. H. Madrid Padilla et al.

Fig 4. A graph showing Case (i) in the proof of Lemma B.5.

and
3/2
nmin κδ
Δηs,e
k
(z0 ) ≥ (c1 /2) , (B.16)
nmax (e − s)1/2
where c1 > 0 is a sufficiently small constant. In addition, assume that
T4   κδ 4 n5min
max |Δts,e (z0 )| − Δηs,e
k
(z0 ) ≤ 3 log + 3 log n1:T ≤ 9/2
.
s<t<e δ (e − s)7/2 nmax
(B.17)
Then there exists d ∈ (s, e) satisfying
c1 δn2min
|d − ηk | ≤ , (B.18)
32n2max
and
n2min ηk
Δηs,e
k
(z0 ) − Δds,e (z0 ) > c|d − ηk |δ Δ (z0 )(e − s)−2 ,
n2max s,e
where c > 0 is a sufficiently small constant.
Proof. Let us assume without loss of generality that d ≥ ηk . Following the
argument of Lemma 2.6 in Venkatraman (1992), it suffices to consider two cases:
(i) ηk+1 > e and (ii) ηk+1 ≤ e.
Case (i) ηk+1 > e. It holds that
1/2
N1 N2
Δηs,e
k
(z0 ) = Fηk (z0 ) − Fηk+1 (z0 )
N1 + N2
and
 1/2
N2 − N3
Δds,e (z0 ) = N1 Fηk (z0 ) − Fηk+1 (z0 ) ,
(N1 + N3 )(N1 + N2 )
where N1 = ns:ηk , N2 = n(ηk +1):e and N3 = n(ηk +1):d . Therefore, due to (B.15),
we have
  1/2
N1 (N2 − N3 )
El = Δs,e (z0 ) − Δs,e (z0 ) = 1 −
ηk d
Δηs,e
k
(z0 )
N2 (N1 + N3 )
N1 + N2
= 1/2 ! 1/2 1/2 "
N3 Δηs,e
k
(z0 )
{N2 (N1 + N3 )} {N2 (N1 + N3 )} + {N1 (N2 − N3 )}
Optimal nonparametric change point analysis 1185

Fig 5. Illustrations of Case (ii) in the proof of Lemma B.5.

n2min
≥ c1 |d − ηk |δΔηs,e
k
(z0 )(e − s)−2 . (B.19)
n2max

Case (ii) ηk+1 ≤ e. Let N1 = ns:ηk , N2 = n(ηk +1):(ηk +h) and N3 = n(ηk +h+1):e ,
where h = c1 δ/8. Then,
1/2
N1 + N2 + N3
Δηs,e
k
(z0 ) = a and
N1 (N2 + N3 )
1/2
N1 + N2 + N3
Δηs,e
k +h
(z0 ) = (a + N2 θ) .
N3 (N1 + N2 )
where

ηk
1 
e
a= nl Fl (z0 ) − c0 , c0 = nl Fl (z0 )
ns:e
l=s l=s

and

a{(N1 + N2 )N3 }1/2 1 1
θ= −
N2 {N1 (N2 + N3 )} 1/2 (N1 + N2 )N3

b
+ ,
a(N1 + N2 + N3 )1/2

ηk +h
with b = Δs,e (z0 ) − Δηs,e
k
(z0 ).
Next, we set l = d − ηk ≤ h/2 and N4 = n(ηk +1):d . Therefore, as in the proof
of Lemma 2.6 in Venkatraman (1992), we have that

El = Δηs,e
k
(z0 ) − Δs,e
ηk +l
(z0 ) = E1l (1 + E2l ) + E3l , (B.20)

where
aN4 (N2 − N4 )(N1 + N2 + N3 )1/2
E1l =
{N1 (N2 + N3 )(N1 + N4 )(N2 + N3 − N4 )}1/2
1
× ,
{(N1 + N4 )(N2 + N3 − N4 )}1/2 + {N1 (N2 + N3 )}1/2
1186 O. H. Madrid Padilla et al.

(N3 − N1 )(N3 − N1 − N4 )
E2l =
{(N1 + N4 )(N2 + N3 − N4 )}1/2 + {(N1 + N2 )N3 }1/2
1
× ,
{N1 (N2 + N3 )}1/2 + {(N1 + N2 )N3 }1/2
and  1/2
bN4 (N1 + N2 )N3
E3l = − .
N2 (N1 + N4 )(N2 + N3 − N4 )
Since N2 − N4 ≥ nmin c1 δ/16, it holds that
n2min ηk
E1l ≥ c1l |d − ηk |δ Δ (z0 )(e − s)−2 , (B.21)
n2max s,e
where c1l > 0 is a sufficiently small constant depending on c1 . As for E2l , due
to (B.18), we have
E2l ≥ −1/2. (B.22)
As for E3l , we have

T4   n2 e − s
E3l ≥ − 3 log + 3 log n1:T |d − ηk | 2min 2 2
δ nmax c1 δ

T4   n2
≥ −c3l 3 log + 3 log n1:T |d − ηk |Δηs,e
k
(z0 )δ(e − s)−2 2min
δ nmax
9/2
nmax log(n1:T )
× (e − s)7/2
n5min κδ 4
n2min ηk
≥ −c1l /2|d − ηk |δ Δ (z0 )(e − s)−2 , (B.23)
n2max s,e
where the first inequality follows from (B.17), the second inequality from (B.16),
and the last from (B.17).
Combining (B.20), (B.21), (B.22) and (B.23), we have
n2min ηk
Δηs,e
k
(z0 ) − Δds,e (z0 ) ≥ c|d − ηk |δ Δ (z0 )(e − s)−2 , (B.24)
n2max s,e
where c > 0 is a sufficiently small constant.
In view of (B.19) and (B.24), we conclude the proof.
Lemma B.6. Suppose (s, e) ⊂ (1, T ) such that e − s ≤ CM δ and that
ηk−1 ≤ s ≤ ηk ≤ . . . ≤ ηk+q ≤ e ≤ ηk+q+1 , q ≥ 0.
Denote
κs,e
max = max κp : p = k, . . . , k + q .
Then for any p ∈ {k − 1, . . . , k + q}, it holds that
 
 1  e 
 
sup  nt Ft (z) − Fηp (z) ≤ (CM + 1)κs,e
max .

z∈R ns:e t=s 
Optimal nonparametric change point analysis 1187

Proof. Since e − s ≤ CM δ, the interval (s, e) contains at most CM + 1 true


change points. Note that
 
 1  e 
 
sup  nt Ft (z) − Fηp (z)
z∈R  ns:e t=s 
  η
1   
k ηk+1
= sup  n t F ηk−1 (z) − F ηp (z) + nt Fηk (z) − Fηp (z) + . . .
z∈R ns:e  t=s t=ηk +1

 e 

+ nt Fηk+q (z) − Fηp (z) 

t=ηk+q +1
 ηk ηk+1 e
|p − k| t=s nt + |p − k − 1| t=ηk +1 nt + . . . + |p − k − q − 1| t=ηk+q +1 nt

ns:e
· κs,e
max
≤(CM + 1)κs,e max .

For any x = (xi ) ∈ Rns:e , define

1 
ns:e
Ps,e
d
(x) = xi + x, ψs,e
d
ψs,e
d
,
ns:e i=1

where ·, · is the inner product in Euclidean space, and ψs,e d


∈ Rns:e with
⎧ 1/2

⎨ n(d+1):e
d ns:e ns:d , i = 1, . . . , ns:d ,
(ψs,e )i =  1/2

⎩− ns:d
ns:e n(d+1):e , i = ns:d + 1, . . . , ns:e ,

i.e. the i-th entry of Ps,e


d
(x) satisfies
 ns:d
1
j=1 xj , i = 1, . . . , ns:d ,
Ps,e
d
(x)i = ns:d
1
ns:e
n(d+1):e j=ns:d +1 xj , i = ns:d + 1, . . . , ns:e .

Lemma B.7. Suppose Condition 1.1 holds and consider any interval
(s, e) ⊂ (1, T ) satisfying that there exists a true change point ηk ∈ (s, e). Let

b ∈ arg max Ds,e


t
and z0 ∈ arg max |Ds,e
b
(z)|.
s<t<e z∈R

Let  
μs,e = Fs (z0 ), . . . , Fs (z0 ), . . . , Fe (z0 ), . . . , Fe (z0 ) ∈ Rns:e
# $% & # $% &
ns ne

and
 
Ys,e = 1{Ys,1 ≤z0 } , . . . , 1{Ys,ns ≤z0 } , . . . , 1{Ye,1 ≤z0 } , . . . , 1{Ye,ne ≤z0 } ∈ Rns:e .
# $% & # $% &
ns ne
1188 O. H. Madrid Padilla et al.

We have
'  '2 '  '2 '  '2
'Ys,e − Ps,e
b
Ys,e ' ≤ 'Ys,e − Ps,e
ηk
Ys,e ' ≤ 'Ys,e − Ps,e
ηk
μs,e ' . (B.25)

Proof. Note that for any d ∈ (s, e), we have


'  '2
'Ys,e − Ps,e
d
Ys,e ' = ns:d (Y1 − Y12 ) + n(d+1):e (Y2 − Y22 )
e n t
2 t=s i=1 1{Yt,i ≤z0 }
2
e  nt
= − Ds,e (z0 ) +
t
− 1{Yt,i ≤z0 } ,
ns:e t=s i=1

where

1   
d nt e nt
1
Y1 = 1{Yt,i ≤z0 } , and Y2 = 1{Yt,i ≤z0 } .
ns:d t=s i=1 n(d+1):e
t=d+1 i=1

It follow from the definition of b, we have that


'  '2 '  '2
'Ys,e − Ps,e
b
Ys,e ' ≤ 'Ys,e − Ps,e
ηk
Ys,e ' .

The second inequality in (B.25) follows from the observation that the sum of
the squares of errors is minimized by the sample mean.
Lemma B.8. Let (s, e) ⊂ (1, T ) contains two or more change points such that

ηk−1 ≤ s ≤ ηk ≤ . . . ≤ ηk+q ≤ e ≤ ηk+q+1 , q ≥ 1.

If ηk − s ≤ c1 δ, for c1 > 0, then


1/2
c1 nmax
Δηs,e
k
≤ Δηs,e
k+1
+ 2n1/2
s:ηk κk .
nmin

Proof. Consider the distribution sequence {Gt }et=s be such that



Fηk +1 , t = s + 1, . . . , ηk ,
Gt =
Ft , t = ηk + 1, . . . , e.

For any s < t < e, define  t 


Gs,e
t
= sup Gs,e (z) ,
z∈R

where
 
1  
1/2 t e
ns:t n(t+1):e 1
Gs,e
t
(z) = nl Gl (z) − nl Gl (z) .
ns:e ns:t n(t+1):e
l=s l=t+1

For any t ≥ ηk and z ∈ R, it holds that



 t  n(t+1):e 1/2
 
Δs,e (z) − Gs,e
t
(z) = ns:ηk Fηk+1 (z) − Fηk (z) ≤ n1/2
s:ηk κk .
ns:e ns:t
Optimal nonparametric change point analysis 1189

Thus we have
   
Δηs,e
k
= sup Δηs,e
k
(z) − Gs,e
ηk
(z) + Gs,e
ηk
(z) ≤ sup Δηs,e
k
(z) − Gs,e
ηk
(z) + Gs,e
ηk
z∈R z∈R
 1/2
ns:ηk n(ηk+1 +1):e
≤ Gs,e
ηk
s:ηk κk ≤
+ n1/2 Gs,e
ηk+1
+ n1/2
s:ηk κk
ns:ηk+1 n(ηk +1):e
1/2
c1 nmax
≤ Δηs,e
k+1
+ 2n1/2
s:ηk κk .
nmin

Lemma B.9. Under Condition 1.1, let (s0 , e0 ) be an interval with e0 − s0 ≤


CM δ and contain at least one change point ηk such that

ηk−1 ≤ s0 ≤ ηk ≤ . . . ≤ ηk+q ≤ e0 ≤ ηk+q+1 , q ≥ 0.

Suppose that there exists k  such that

min ηk − s0 , e0 − ηk ≥ δ/16.

Let
s,e = max κp : min{ηp − s0 , e0 − ηp } ≥ δ/16 .
κmax
Consider any generic (s, e) ⊂ (s0 , e0 ), satisfying

min{ηk − s0 , e0 − ηk } ≥ δ/16, ηk ∈ (s, e).

Let b ∈ arg maxs<t<e Ds,e


t
. For some c1 > 0 and γ > 0, suppose that

3/2
nmin
b
Ds,e ≥ c1 κmax
s,e δ
1/2
, (B.26)
nmax
 
max sup Λts,e (z) ≤ γ, (B.27)
s<t<e z∈R

and  
 e nt 
 
max sup n−1/2 1{Yt,i ≤z} − Ft (z)  ≤ γ. (B.28)
1≤s<e≤T z∈R  
s:e
t=s i=1

If there exists a sufficiently small 0 < c2 < c1 /2 such that


3/2
nmin
γ ≤ c2 κmax
s,e δ
1/2
, (B.29)
nmax

then there exists a change point ηk ∈ (s, e) such that

n9max −2 2
min{e − ηk , ηk − s} ≥ δ/4 and |ηk − b| ≤ C κ γ ,
n10
min

where C > 0 is a sufficiently large constant.


1190 O. H. Madrid Padilla et al.

Proof. Without loss of generality, assume that Δbs,e > 0 and that Δts,e is locally
decreasing at b. Observe that there has to be a change point ηk ∈ (s, b), or oth-
erwise Δbs,e > 0 implies that Δts,e is decreasing, as a consequence of Lemma B.1.
Thus, if s ≤ ηk ≤ b ≤ e, then
3/2 3/2
nmin 1/2 nmin
Δηs,e
k
≥ Δbs,e ≥ Ds,e
b
− γ ≥ (c1 − c2 )κmax
s,e δ
1/2
≥ (c1 /2)κmax
s,e δ ,
nmax nmax
(B.30)
where the second inequality follows from (B.27), and the second inequality fol-
lows from (B.26) and (B.29). Observe that e − s ≤ e0 − s0 ≤ CM δ and that
(s, e) has to contain at least one change point or otherwise maxs<t<e Δts,e = 0,
which contradicts (B.30).
Step 1. In this step, we are to show that
min{ηk − s, e − ηk } ≥ min{1, c21 }δ/16. (B.31)
Suppose that ηk is the only change point in (s, e). Then (B.31) must hold or
otherwise it follows from (B.5) in Lemma B.2, we have
c1 δ 1/2
Δηs,e
k
≤ κk n1/2
max ,
4
which contradicts (B.30).
Suppose (s, e) contains at least two change points. Then ηk − s < min{1, c21 }δ/16
implies that ηk is the most left change point in (s, e). Therefore it follows from
Lemma B.8 that
1/2
c1 nmax
Δηs,e
k
≤ Δηs,e
k+1
+ 2n1/2
s:ηk κk
4 nmin
1/2
c1 nmax δ 1/2
≤ max Δts,e + c1 n1/2
max κk
4 nmin s<t<e 4
1/2 1/2
c1 nmax c1 nmax δ 1/2
≤ t
max Ds,e + γ+ c1 n1/2
max κk
4 nmin s<t<e 4 nmin 4
≤ max t
Ds,e − γ,
s<t<e

which contradicts with (B.30).


Step 2. It follows from Lemma B.5 that there exists d ∈ (ηk , ηk +c1 δn2min n−2
max /32)
and that
Δηs,e
k
− Δds,e ≥ 2γ. (B.32)
We claim that b ∈ (ηk , d) ⊂ (ηk , ηk + c1 δn2min n−2
max /16). By contradiction, sup-
pose that b ≥ d. Then
Δbs,e ≤ Δds,e < Δηs,e
k
− 2γ ≤ max Δts,e − 2γ ≤ max Ds,e
t
− γ = Ds,e
b
− γ, (B.33)
s<t<e s<t<e

where the first inequality follows from Lemma B.1, the second follows from
(B.32), and the fourth follows from (B.27). Note that (B.33) is a contradiction
with (B.30), therefore we have b ∈ (ηk , ηk + c1 δn2min n−2
max /32).
Optimal nonparametric change point analysis 1191

Step 3. It follows from (B.25) in Lemma B.7 that


'  '2 '  '2 '  '2
'Ys,e − Ps,e
b
Ys,e ' ≤ 'Ys,e − Ps,e
ηk
Ys,e ' ≤ 'Ys,e − Ps,e
ηk
μs,e ' ,

with the notation defined in Lemma B.7. By contradiction, we assume that

n9max −2 2
ηk + C κ γ < b, (B.34)
n10
min

where C > 0 is a sufficiently large constant. We are to show that this leads to
the bound that
'  '2 '  '2
'Ys,e − Ps,e
b
Ys,e ' > 'Ys,e − Ps,e
ηk
μs,e ' , (B.35)

which is a contradiction.
We have min{ηk −s, e−ηk } ≥ min{1, c21 }δ/16 and |b−ηk | ≤ c1 δn2min n−2
max /32.
For properly chose c1 , we have

min{e − b, b − s} ≥ min{1, c21 }δ/32.

It holds that
'  '2 '  '2
'Ys,e − Ps,e
b
Ys,e ' − 'Ys,e − Ps,e ηk
μs,e '
'  '2 '  '2
='μs,e − Ps,e
b
μs,e ' − 'μs,e − Ps,e ηk
μs,e ' +
   
2Ys,e − μs,e , Ps,e
ηk
μs,e − Ps,e
b
Ys,e .

Therefore if we can show that


    '  '2 '  '2
2Ys,e −μs,e , Ps,e
b
Ys,e −Ps,e
ηk
μs,e  < 'μs,e −Ps,e
b
μs,e ' − 'μs,e −Ps,e
ηk
μs,e ' ,
(B.36)
then (B.35) holds.
As for the right-hand side of (B.36), we have
'  '2 '  '2  2  2
'μs,e − Ps,eb
μs,e ' − 'μs,e − Ps,e ηk
μs,e ' = Δηs,e
k
(z0 ) − Δbs,e (z0 )
 
≥ Δηs,e
k
(z0 ) − Δbs,e (z0 ) Δηs,e
k
(z0 ). (B.37)

We are then to utilize the result of Lemma B.5. Note that z0 there can be
any z0 ∈ R satisfying conditions thereof. Equation (B.16) holds due to the fact
that here we have
 η     b  3/2
1/2 nmin
3/2
max 1/2 nmin
Δs,e
k
(z0 ) ≥Δbs,e (z0 ) ≥ Ds,e (z0 ) − γ ≥ c1 κmax
s,e δ − c κ
2 s,e δ
nmax nmax
3/2
nmin
≥c1 /2κmax
s,e δ
1/2
, (B.38)
nmax
where the first inequality follows from the fact that ηk is a true change point,
the second inequality from (B.27), the third inequality follows from (B.26) and
1192 O. H. Madrid Padilla et al.

(B.29), and the final inequality follows from the condition that 0 < c2 < c1 /2.
Towards this end, it follows from Lemma B.5 that
n2min ηk
Δηs,e
k
(z0 ) − Δbs,e (z0 ) ≥ c|b − ηk |δ Δ (z0 )(e − s)−2 . (B.39)
n2max s,e
Combining (B.37), (B.38) and (B.39), we have
'  '2 '  '2 cc2 n5
'μs,e − Ps,e
b
μs,e ' − 'μs,e − Ps,e ηk
μs,e ' ≥ 1 δ 2 4min κ2 (e − s)−2 |b − ηk |.
4 nmax
(B.40)
The left-hand side of (B.36) can be decomposed as follows.
   
2Ys,e − μs,e , Ps,e
b
Ys,e − Ps,e
ηk
μs,e 
       
=2Ys,e − μs,e , Ps,e
b
Ys,e − Ps,e
b
μs,e  + 2Ys,e − μs,e , Ps,e
b
μs,e − Ps,e
ηk
μs,e 
⎛ ⎞
ns:ηk
 
ns:b 
ns:e
   b    
=(I) + 2 ⎝ + + ⎠ Ys,e − μs,e Ps,e
i
μs,e − Ps,e
ηk
μs,e i
i=1 i=ns:ηk +1 i=ns:b +1

=(I) + (II.1) + (II.2) + (II.3). (B.41)


Term (I). It holds that
⎧ ⎫2 ⎧ ⎫2
2 ⎨  ⎬ ⎨   ⎬
ns:b
 2
ns:e

(I) = Ys,e − μs,e j + Ys,e − μs,e j ≤ 2γ 2 ,
ns:b ⎩j=1 ⎭ n(b+1):e ⎩j=n +1 ⎭
s:b

(B.42)
where the inequality follows from the definition of the CUSUM statistics and
(B.27).
Term (II). It holds that
 ns:ηk  
 1 
ns:b
1 
ηk
−1/2
1/2
(II.1) = 2ns:ηk ns:ηk (Ys,e − μs,e )i (μs,e )i − (μs,e )i .
i=1
ns:b i=1 ns:ηk i=1

In addition, it holds that


 
 1  ns:b
1 
ηk 
 
 (μs,e )i − (μs,e )i 
 ns:b ns:η k i=1

i=1
 
n(ηk +1):b  
ns:η
1 k 
= − (μs,e )i + Fηk +1 (z0 )
ns:b  ns:ηk i=1 
n(ηk +1):b
≤ (CM + 1)κmax s,e ,
ns:b
where the inequality follows from Lemma B.6. Combining with (B.28), it leads
to that
n(ηk +1):b
(II.1) ≤ 2n1/2
s:ηk γ (CM + 1)κmax
s,e
ns:b
Optimal nonparametric change point analysis 1193

3/2
nmax 4
≤2 δ −1/2 γ|b − ηk |(CM + 1)κmax
s,e . (B.43)
nmin min{1, c21 }

As for the term (II.2), it holds that

(II.2) ≤ 2n1/2
max |b − ηk |
1/2
γ(2CM + 3)κmax
s,e . (B.44)

As for the term (II.3), it holds that


3/2
nmax 4
(II.3) ≤ 2 δ −1/2 γ|b − ηk |(CM + 1)κmax
s,e . (B.45)
nmin min{1, c21 }

Therefore, combining (B.40), (B.41), (B.42), (B.43), (B.44) and (B.44), we


have that (B.36) holds if
 
5 3/2
2 nmin 2 −2 2 nmax −1/2
δ 4 κ (e−s) |b−ηk |  max γ , δ γ|b − ηk |κ, nmax |b − ηk | γκ .
1/2 1/2
nmax nmin

The second inequality holds due to Condition 3.1, the third inequality holds due
to (B.34) and the first inequality is a consequence of the third inequality and
Condition 3.1.

Appendix C: Proofs of Section 3.2

Proof of Lemma 3.1. Let P0 denote the joint distribution of the independent
random variables {Yt,i }n,T
i=1,t=1 such that Y1,1 , . . . , Yδ,n are independent and iden-
tically distributed as δ0 and Yδ+1,1 , . . . , YT,n are independent and identically
distributed as δ1 , where δc , c ∈ R, is the Dirac distribution having point mass
at point c.
Let P1 denote the joint distribution of the independent random variables
{Zt,i }n,T
i=1,t=1 such that Z1,1 , . . . , ZT −δ,n are independent and identically dis-
tributed as δ1 and ZT −δ+1,1 , . . . , ZT,n are independent and identically distributed
as δ0 .
It holds that η(P0 ) = δ and η(P1 ) = T − δ. Since δ ≤ T /3, it holds that

  1 − 1/2
inf sup EP |η̂ − η| ≥ (T /3) 1 − dTV (P0 , P1 ) ≥ (T /3) 1 − 2δn ≥ T,
η̂ P ∈P 3

where dTV (·, ·) is the total variation distance. In the last display, the first in-
equality follows from Le Cam’s lemma (see, e.g. Yu, 1997), and the second
inequality follows from Eq.(1.2) in Steerneman (1983).
Proof of Lemma 3.2. Let P0 denote the joint distribution of the independent
random variables {Yt,i }n,T
i=1,t=1 such that Y1,1 , . . . , Yδ,n are independent and iden-
tically distributed as F and Yδ+1,1 , . . . , YT,n are independent and identically
distributed as G.
1194 O. H. Madrid Padilla et al.

Let P1 be the joint distribution of the independent random variables{Zt,i }n,Ti=1,t=1


such that Z1,1 , . . . , Zδ+ξ,n are independent and identically distributed as F and
Zδ+ξ+1,1 , . . . , ZT,n are independent and identically distributed as G, where ξ is
a positive integer no larger than n − 1 − δ,


⎨0, x ≤ 0,
F (x) = x, 0 < x ≤ 1,


1, x ≥ 1,

and ⎧

⎪ 0, x ≤ 0,

⎨(1 − 2κ)x, 0 < x ≤ 1/2,
G(x) =

⎪ (1/2 − κ) + (1 + 2κ)(x − 1/2), 1/2 < x ≤ 1,


1, x ≥ 1.
It holds that
sup |F (z) − G(z)| = κ,
z∈R

η(P0 ) = δ and η(P1 ) = δ+ξ. By Le Cam’s Lemma (e.g. Yu, 1997) and Lemma 2.6
in Tsybakov (2009), it holds that
  ξ
inf sup EP |η̂ − η| ≥ ξ 1 − dTV (P0 , P1 ) ≥ exp (−KL(P0 , P1 )) , (C.1)
η̂ P ∈Q 2
where KL(·, ·) denotes the Kullback–Leibler divergence.
Since
 nξ
KL(P0 , P1 ) = KL(P0i , P1i ) = log(1 − 4κ2 ) ≤ 2nξκ2 ,
2
i∈{δ+1,...,δ+ξ}

we have
  ξ
inf sup EP |η̂ − η| ≥ exp(−2nξκ2 ).
η̂ P ∈Q 2
Set ξ = min{ nκ1 2 , T − 1 − δ}. By the assumption on ζT , for all T large
enough we must have that ξ =  nκ1 2 . Thus, for all T large enough, using (C.1),

  1 / 1 0 −2
inf sup EP |η̂ − η| ≥ max 1, e .
η̂ P ∈Q 2 nκ2

Appendix D: Proof of Theorem 3.2

Proof. It follows from Theorem 3.1 and the proof thereof that applying Algo-
rithm 1 to {Wt,i } and the τ sequence defined in (3.7), with probability at least

24 log(n1:T ) 48T T M δ2
1− − − exp log − ,
T 3 n1:T n1:T log(n1:T )δ δ 16T 2
the event A, which is defined as follows holds.
Optimal nonparametric change point analysis 1195

A1 if τ > cτ,2 κδ 1/2 nmin n−1


3/2
max , then the corresponding change point estimators
satisfying K  < K, but for any η̂ in the estimator set, there exits k ∈
{1, . . . , K} such that

|η̂ − ηk | ≤ C κ−2 9 −10


k log(n1:T )nmax nmin ;

A2 if cτ,2 κδ 1/2 nmin n−1


3/2
max ≥ τ ≥ cτ,1 log
1/2
(n1:T ), then the corresponding change

point estimators satisfying K = K, and for any η̂ in the estimator set,
there exits k ∈ {1, . . . , K} such that

|η̂ − ηk | ≤ C κ−2 9 −10


k log(n1:T )nmax nmin ;

A3 if τ < cτ,1 log1/2 (n1:T ), then the corresponding change point estimators
 > K, and for any true change point ηk , there exits η̂ in the
satisfying K
estimators such that

|η̂ − ηk | ≤ C κ−2 9 −10


k log(n1:T )nmax nmin .

The rest of the proof is conducted conditionally on the event A.


Different τj ’s may return the same collections of the change point estimators.
For simplicity, in the rest of the proof, we assume that distinct candidate τj ’s
in (3.7) return distinct and nested Bj with |Bj | = Kj .

Step 1. Let η̂0 = 0 and η̂K+1 = T . In this step, it suffices to show that for any
k ∈ {0, . . . , K − 1}, it holds that with large probability


k+1 nt 
 
η̂l+1 2
1{Yt,i ≤ẑ} − F(η̂
Y
l +1):η̂l+1
(ẑ) +λ
l=k t=η̂l +1 i=1
nt 
 
η̂k+2 2
< 1{Yt,i ≤ẑ} − F(η̂
Y
k +1):η̂k+2
(ẑ) . (D.1)
t=η̂k +1 i=1

Without loss of generality, we consider the case when k = 0.


With probability at least

24 log(n1:T ) 48T T M δ2
1− − − exp log − ,
T 3 n1:T n1:T log(n1:T )δ δ 16T 2

it holds that

 nt 
η̂2  2 
1  
η̂l+1 nt  2
1{Yt,i ≤ẑ} − F1:η̂2 (ẑ) − 1{Yt,i ≤ẑ} − F(η̂l +1):η̂l+1 (ẑ)
t=1 i=1 l=0 t=η̂l +1 i=1
 2 n3
η̂1
= D1,η̂ ({Yt,i }) ≥ c2τ,2 κ2 δ 2min , (D.2)
2
nmax
where the last inequality follows from the proof of Theorem 3.1.
1196 O. H. Madrid Padilla et al.

Therefore, for λ = C log(n1:T ), (D.1) holds due to Condition 3.1.

Step 2. In this step, we are to show with large probability, Algorithm 3 will not
over select. For simplicity, assume B2 = {η̂1 } and B1 = {η̂, η̂1 } with 0 < η̂ < η̂1 .
Let ẑ be the one defined in Algorithm 3 using the triplet {0, η̂, η̂1 }.
Since

 nt 
η̂1  2  nt 
η̂  2
1{Yt,i ≤ẑ} − F1:η̂
Y
1
(ẑ) − 1{Yt,i ≤ẑ} − F1:η̂
Y
(ẑ)
t=1 i=1 t=1 i=1

 nt 
η̂1  2  2
− 1{Yt,i ≤ẑ} − F(η̂+1):η̂
Y
1
(ẑ) η̂
= D0,η̂ 1
({Yt,i }) ≤ c2τ,1 log(n1:T )
t=η̂+1 i=1

holds with probability at least



24 log(n1:T ) 48T T Sδ 2
1− − − exp log − .
T 3 n1:T n1:T log(n1:T )δ δ 16T 2

Therefore, for λ = C log(n1:T ), (D.1) holds.


Combining both steps above and the fact that these two steps are conducted
in the event A, we have that
 
P K = K and k ≤ C κ−2 log(n1:T )n9max n−10 , ∀k = 1, . . . , K
k min

48 log(n1:T ) 96T T M δ2
≥1 − − − exp log − .
T 3 n1:T n1:T log(n1:T )δ δ 16T 2

Appendix E: Sensitivity simulations

We now explore the sensitivity of the tuning parameter C (λ = C log(n1:T ))


in Algorithm 3 and the tuning parameter τ in Algorithm 1. We choose C, τ ∈
{0.1, 0.2, 0.3, . . . , 2.9, 3.0} and for each algorithm calculate the corresponding
number of change point for data generated under Scenarios 2 and 3 with T =
1000 and nt = 1 for all t, in Section 4. Specifically, for each scenario we generate
50 data sets and report the median number of estimated change points for
Algorithms 1 and 3 based on different choices of their tuning parameters.
The results in Figure 6 clearly show that Algorithm 3 is less sensitive to C
than Algorithm 1 is to τ .
Optimal nonparametric change point analysis 1197

Fig 6. Sensitivity of the tuning parameters in Algorithms 1 and 3. The left panel shows the
median of the estimated number of change points by Algorithms 1 and 3 under 50 Monte Carlo
simulations based on Scenario 2 in Section 4. The right panel shows plot the corresponding
for Scenario 3 in Section 4.

References

Anastasiou, A. and Fryzlewicz, P. (2019). Detecting multiple generalized


change-points by isolating single ones. arXiv preprint 1901.10852.
Arlot, S., Celisse, A. and Harchaoui, Z. (2019). A kernel multiple change-
point algorithm via model selection. Journal of Machine Learning Research
20 1–56. MR4048973
Aue, A., Hömann, S., Horváth, L. and Reimherr, M. (2009). Break detec-
tion in the covariance structure of multivariate nonlinear time series models.
The Annals of Statistics 37 4046-4087. MR2572452
Avanesov, V. and Buzun, N. (2016). Change-point detection in high-
dimensional covariance structure. arXiv preprint 1610.03783. MR3861282
Bai, J. and Perron, P. (2003). Computation and analysis of multiple struc-
tural change models. Journal of applied econometrics 18 1–22.
Baranowski, R., Chen, Y. and Fryzlewicz, P. (2019). Narrowest-over-
threshold detection of multiple change points and change-point-like features.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81
649–672. MR3961502
Baranowski, R. and Fryzlewicz, P. (2019). wbs: Wild Binary Segmentation
for Multiple Change-Point Detection R package version 1.4.
Bleakley, K. and Vert, J.-P. (2011). The group fused lasso for multiple
change-point detection. arXiv preprint 1106.4199.
Boukai, B. and Zhou, H. (1997). Nonparametric estimation in a two change-
point model. Journal of Nonparametric Statistics 8 275–292. MR1487005
Carlstein, E. (1988). Nonparametric change-point estimation. The Annals of
Statistics 16 188–197. MR0924865
Celisse, A., Marot, G., Pierre-Jean, M. and Rigaill, G. (2018). New effi-
cient algorithms for multiple change-point detection with reproducing kernels.
1198 O. H. Madrid Padilla et al.

Computational Statistics & Data Analysis 128 200–220. MR3850633


Chan, N. H., Yau, C. Y. and Zhang, R.-M. (2014). Group LASSO for struc-
tural break time series. Journal of the American Statistical Association 109
590–599. MR3223735
Chan, K.-s., Li, J., Eichinger, W. and Bai, E.-W. (2014). A distribution-
free test for anomalous gamma-ray spectra. Radiation Measurements 63 18–
25.
Cho, H. (2016). Change-point detection in panel data via double CUSUM
statistic. Electronic Journal of Statistics 10 2000–2038. MR3522667
Cho, H. and Fryzlewicz, P. (2015). Multiple change-point detection for
high-dimensional time series via Sparsified Binary Segmentation. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 77 475-507.
MR3310536
Chowdhury, M. F. R., Selouani, S.-A. and O’Shaughnessy, D. (2012).
Bayesian on-line spectral change point detection: a soft computing approach
for on-line ASR. International Journal of Speech Technology 15 5–23.
Cleynen, A., Rigaill, G. and Koskas, M. (2016). Segmentor3IsBack: A Fast
Segmentation Algorithm R package version 2.0.
Cribben, I. and Yu, Y. (2017). Estimating whole-brain dynamics by using
spectral clustering. Journal of the Royal Statistical Society: Series C (Applied
Statistcs) 66 607–627. MR3632344
Darkhovski, B. S. (1994). Nonparametric methods in change-point problems:
A general approach and some concrete algorithms. Lecture Notes-Monograph
Series 99–107. MR1477917
Duembgen, L. and Wellner, J. A. (2014). Confidence bands for distribution
func- tions: A new look at the law of the iterated logarithm. arXiv preprint
1402.2918.
Eichinger, B. and Kirch, C. (2018). A MOSUM procedure for the estimation
of multiple random change points. Bernoulli 24 526–564. MR3706768
Fan, Z., Dror, R. O., Mildorf, T. J., Piana, S. and Shaw, D. E. (2015).
Identifying localized changes in large systems: Change-point detection for
biomolecular simulations. Proceedings of the National Academy of Sciences
112 7454–7459.
Fearnhead, P. and Rigaill, G. (2018). Changepoint detection in the presence
of outliers. Journal of the American Statistical Association 1–15. MR3941246
Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2011).
A sticky HDP-HMM with application to speaker diarization. The Annals of
Applied Statistics 5 1020–1056. MR2840185
Frick, K., Munk, A. and Sieling, H. (2014). Multiscale change point infer-
ence. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy) 76 495-580. MR3210728
Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point
detection. The Annals of Statistics 42 2243–2281. MR3269979
Garreau, D. and Arlot, S. (2018). Consistent change-point detection with
kernels. Electronic Journal of Statistics 12 4440–4486. MR3892345
Harchaoui, Z. and Cappé, O. (2007). Retrospective mutiple change-point es-
Optimal nonparametric change point analysis 1199

timation with kernels. In 2007 IEEE/SP 14th Workshop on Statistical Signal


Processing 768–772. IEEE.
Hawkins, D. M. and Deng, Q. (2010). A nonparametric change-point control
chart. Journal of Quality Technology 42 165–173.
Haynes, K., Fearnhead, P. and Eckley, I. A. (2017). A computation-
ally efficient nonparametric approach for changepoint detection. Statistics and
Computing 27 1293–1305. MR3647098
Itoh, N. and Kurths, J. (2010). Change-point detection of climate time series
by nonparametric method. In Proceedings of the world congress on engineering
and computer science 1 445–448. Citeseer.
Jewell, S., Hocking, T. D., Fearnhead, P. and Witten, D. (2018). Fast
nonconvex deconvolution of calcium imaging data. arXiv preprint 1802.07380.
MR4164053
Killick, R. and Eckley, I. A. (2014). changepoint: An R Package for Change-
point Analysis. Journal of Statistical Software 58 1–19.
Killick, R., Fearnhead, P. and Eckley, I. A. (2012). Optimal detection
of changepoints with a linear computational cost. Journal of the American
Statistical Association 107 1590–1598. MR3036418
Knuth, D. (1998). Section 5.2. 4: Sorting by merging. The Art of Computer
Programming 3 158–168. MR3077154
Kovács, S., Li, H., Bühlmann, P. and Munk, A. (2020). Seeded Binary
Segmentation: A general methodology for fast and optimal change point de-
tection. arXiv preprint 2002.06633.
Li, S., Xie, Y., Dai, H. and Song, L. (2019). Scan B-statistic for kernel
change-point detection. Sequential Analysis 38 503–544. MR4057156
Liu, S., Yamada, M., Collier, N. and Sugiyama, M. (2013). Change-point
detection in time-series data by relative density-ratio estimation. Neural Net-
works 43 72–83.
Liu, F., Choi, D., Xie, L. and Roeder, K. (2018). Global spectral clustering
in dynamic networks. Proceedings of the National Academy of Sciences 115
927–932. MR3763702
Marot, G., Rigaill, G., Pierre-Jean, M. and BRUNIN, M. (2018). pack-
age KernSeg.
Matteson, D. S. and James, N. A. (2013). ecp: An R Package for Non-
parametric Multiple Change Point Analysis of Multivariate Data Technical
Report, Cornell University. MR3180567
Matteson, D. S. and James, N. A. (2014). A nonparametric approach for
multiple change point analysis of multivariate data. Journal of the American
Statistical Association 109 334–345. MR3180567
Padilla, O. H. M., Athey, A., Reinhart, A. and Scott, J. G. (2018).
Sequential nonparametric tests for a change in distribution: an application to
detecting radiological anomalies. Journal of the American Statistical Associ-
ation 1–15. MR3963159
Pein, F., Sieling, H. and Munk, A. (2017). Heterogeneous change point in-
ference. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology) 79 1207–1227. MR3689315
1200 O. H. Madrid Padilla et al.

Pein, F., Hotz, T., Sieling, H. and Aspelmeier, T. (2019). stepR: Multi-
scale change-point inference R package version 2.0-4.
Preuss, P., Puchstein, R. and Dette, H. (2015). Detection of multiple
structural breaks in multivariate time series. Journal of the American Statis-
tical Association 110 654–668. MR3367255
Reinhart, A., Athey, A. and Biegalski, S. (2014). Spatially-aware temporal
anomaly mapping of gamma spectra. IEEE Transactions on Nuclear Science
61 1284–1289.
Rigaill, G. (2010). Pruned dynamic programming for optimal multiple change-
point detection. arXiv preprint 1004.0887 17.
Rizzo, M. L. and Székely, G. J. (2010). Disco analysis: A nonparametric
extension of analysis of variance. The Annals of Applied Statistics 4 1034–
1055. MR2758432
Russell, B. and Rambaccussing, D. (2018). Breaks and the statistical pro-
cess of inflation: the case of estimating the ‘modern’long-run Phillips curve.
Empirical Economics 1–21.
R Core Team (2019). R: A Language and Environment for Statistical Com-
puting R Foundation for Statistical Computing, Vienna, Austria.
Steerneman, T. (1983). On the total variation and Hellinger distance be-
tween signed measures; an application to product measures. Proceedings of
the American Mathematical Society 88 684–688. MR0702299
Tsybakov, A. (2009). Introduction to Nonparametric Estimation. Springer.
MR2724359
Vanegas, L. J., Behr, M. and Munk, A. (2019). Multiscale quantile regres-
sion. arXiv preprint 1902.09321.
Venkatraman, E. S. (1992). Consistency results in multiple change-point
problems, PhD thesis, Stanford University. MR2687536
Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of
Mathematical Statistics 16 117-186. MR0013275
Wang, T. and Samworth, R. J. (2018). High-dimensional changepoint esti-
mation via sparse projection. Journal of the Royal Statistical Society: Series
B (Statistical Methodology) 80 57–83. MR3744712
Wang, D., Yu, Y. and Rinaldo, A. (2018). Optimal change point detec-
tion and localization in sparse dynamic networks. arXiv preprint 1809.09602,
Annals of Statistics, to appear. MR4206675
Wang, D., Yu, Y. and Rinaldo, A. (2020). Univariate mean change point
detection: Penalization, cusum and optimality. Electronic Journal of Statistics
14 1917–1961. MR4091859
Wang, D., Yu, Y. and Rinaldo, A. (2021). Optimal Covariance Change Point
Detection in High Dimension. Bernoulli 27 554–575. MR4177380
Yao, Y. C. (1988). Estimating the number of change-points via Schwarz’ cri-
terion. Statistics & Probability Letters 6 181–189. MR0919373
Yao, Y.-C. and Au, S.-T. (1989). Least-squares estimation of a stop function.
Sankhyā: The Indian Journal of Statistics, Series A 370-381. MR1175613
Yao, Y. C. and Davis, R. A. (1986). The asymptotic behavior of the likeli-
hood ratio statistic for testing a shift in mean in a sequence of independent
Optimal nonparametric change point analysis 1201

normal variates. Sankhyā: The Indian Journal of Statistics, Series A 339–353.


MR0905446
Yu, B. (1997). Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam
423–435. Springer. MR1462963
Zeileis, A., Leisch, F., Hornik, K. and Kleiber, C. (2002). strucchange:
An R Package for Testing for Structural Change in Linear Regression Models.
Journal of Statistical Software 7 1–38.
Zou, C., Yin, G., Feng, L. and Wang, Z. (2014). Nonparametric maximum
likelihood approach to multiple change-point problems. The Annals of Statis-
tics 42 970–1002. MR3210993

You might also like