Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Functional Outlier Detection

Uploaded by

Sen Shen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Functional Outlier Detection

Uploaded by

Sen Shen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Stoch Environ Res Risk Assess (2016) 30:1115–1130

DOI 10.1007/s00477-015-1096-3

ORIGINAL PAPER

Functional outlier detection by a local depth with application


to NOx levels
Carlo Sguera1 • Pedro Galeano1 • Rosa E. Lillo1

Published online: 13 June 2015


Ó Springer-Verlag Berlin Heidelberg 2015

Abstract This paper proposes methods to detect outliers 1 Introduction


in functional data sets and the task of identifying atypical
curves is carried out using the recently proposed kernelized The accurate identification of outliers is an important aspect
functional spatial depth (KFSD). KFSD is a local depth that in any statistical data analysis. Nowadays there are well-
can be used to order the curves of a sample from the most established outlier detection techniques in the univariate and
to the least central, and since outliers are usually among the multivariate frameworks (for a complete review of the topic,
least central curves, we present a probabilistic result which see for example Barnett and Lewis 1994). In recent years,
allows to select a threshold value for KFSD such that new types of data have become available and tractable thanks
curves with depth values lower than the threshold are to the evolution of computing resources, e.g., big multi-
detected as outliers. Based on this result, we propose three variate data sets having more variables than observations
new outlier detection procedures. The results of a simula- (high-dimensional multivariate data) or samples composed
tion study show that our proposals generally outperform a of repeated measurements of the same observation taken
battery of competitors. We apply our procedures to a real over an ordered set of points that can be interpreted as real-
data set consisting in daily curves of emission levels of izations of stochastic processes (functional data). In this
nitrogen oxides (NOx ) since it is of interest to identify paper we focus on functional data, which are usually studied
abnormal NOx levels to take necessary environmental with the tools provided by functional data analysis (FDA).
political actions. For overviews on FDA methods, see Ramsay and Silverman
(2005), Ferraty and Vieu (2006), Horváth and Kokoszka
Keywords Functional depths  Functional outlier (2012) or Cuevas (2014). For environmental statistical
detection  Kernelized functional spatial depth  Nitrogen problems tackled using FDA techniques, see for example
oxides  Smoothed resampling Ignaccolo et al. (2015), Menafoglio et al. (2014) and Ruiz-
Medina and Espejo (2012).
As in univariate or multivariate analysis, the detection of
outliers is also fundamental in FDA. According to Febrero
et al. (2007, 2008), a functional outlier is a curve generated
by a stochastic process with a different distribution than the
one of normal curves. This definition covers many types of
& Carlo Sguera outliers, e.g., magnitude outliers, shape outliers and partial
csguera@est-econ.uc3m.es
outliers, i.e., curves having atypical behaviors only in some
Pedro Galeano segments of the domain. Shape and partial outliers are typ-
pedro.galeano@uc3m.es
ically harder to detect than magnitude outliers (in the case of
Rosa E. Lillo high magnitude, outliers can even be recognized by simply
rosaelvira.lillo@uc3m.es
looking at a graph), and therefore entail more challenging
1
Department of Statistics, Universidad Carlos III de Madrid, outlier detection problems. In this paper we focus on samples
28903 Getafe, Madrid, Spain contaminated by low magnitude, shape or partial outliers.

123
1116 Stoch Environ Res Risk Assess (2016) 30:1115–1130

Specifically, we consider a real data set consisting in W

400
nitrogen oxides (NOx ) emission daily levels measured in
the Barcelona area (see Febrero et al. 2008 for a first

300
analysis of this data set). Since NOx represent one of the
most important pollutants, cause ozone formation and
contribute to global warning, it is of interest the identifi-

200
cation of days with abnormally large NOx emissions to
allow the implementation of actions able to control their

100
causes, which are primarily the combustion processes
generated by motor vehicles and industries.

0
We propose to detect functional outliers using the notion
0 5 10 15 20
of functional depth. A functional depth is a measure pro-
viding a P-based center-outward ordering criterion for Fig. 1 NOx levels measured in lg=m3 every hour of 76 working days
observations of a functional space H, where P is a proba- between 23/02/2005 and 26/06/2005 in Poblenou, Barcelona
bility distribution on H. When a sample of curves is
available, a functional depth orders the curves from the to appreciate that the presence of partial outliers is an
most to the least central according to their depth values issue.
and, if any outlier is in the sample, its depth is expected to We compare our methods with some alternative outlier
be among the lowest values. Therefore, it is reasonable to detection procedures: Febrero et al. (2008) proposed to
build outlier detection methods that use functional depths. label as outliers those curves with depth values lower than a
In this paper we enlarge the number of available certain threshold. As functional depths, they considered the
functional outlier detection procedures by presenting three Fraiman and Muniz depth (Fraiman and Muniz 2001), the
new methods based on a specific depth, the kernelized h-modal depth (Cuevas et al. 2006) and the integrated dual
functional spatial depth (KFSD, Sguera et al. 2014). depth (Cuevas and Fraiman 2009). To determine the depth
KFSD is a local-oriented depth, that is, a depth which threshold, they proposed two different bootstrap procedures
orders curves looking at narrow neighborhoods and giving based on depth-based trimmed or weighted resampling,
more weight to close than distant curves. Its approach is respectively; Sun and Genton (2011) introduced the func-
opposite to what global-oriented depths do. Indeed, any tional boxplot, which is constructed using the ranking of
global depth makes depend the depth of a given curve on curves provided by the modified band depth (López-Pintado
the whole rest of observations, with equal weights for all and Romo 2009). The proposed functional boxplot detects
of them. This is the case of a global-oriented depth such outliers using a rule that is similar to the one of the standard
as the functional spatial depth (FSD, Chakraborty and boxplot; Hyndman and Shang (2010) proposed to reduce
Chaudhuri 2014), of which KFSD is its local version. A the outlier detection problem from functional to multivari-
local depth such as KFSD may result useful to analyze ate data by means of functional principal component anal-
functional samples having a structure deviating from ysis (FPCA), and to use two alternative multivariate
unimodality or symmetry. Moreover, the local approach techniques on the scores to detect outliers, i.e., the bagplot
behind KFSD proved to be a good strategy in supervised and the high density region boxplot, respectively.
classification problems with groups of curves not extre- The remainder of the article is organized as follows. In
mely clear-cut (see Sguera et al. 2014). Alternatively, we Sect. 2 we recall the definition of KFSD. In Sect. 3 we
illustrate that KFSD ranks well low magnitude, shape or consider the functional outlier detection problem. In Theo-
partial outliers, that is, their corresponding KFSD values rem 1 we present the result on which are based three new
are in general lower than those of normal curves. Then, outlier detection methods which employ KFSD as depth
we propose different procedures to select a threshold for function. In Sect. 4 we report the results of our simulation
KFSD to distinguish between normal curves and outliers. study, whereas in Sect. 5 we perform outlier detection on the
These procedures employ smoothing resampling tech- NOx data set. In Sect. 6 we draw some conclusions. Finally,
niques and are based on a theoretical result which allows in the Appendix we report a sketch of the proof of Theorem 1.
to obtain a probabilistic upper bound on a desired false
alarm probability of detecting normal curves as outliers.
Note that the probabilistic foundations of the proposed 2 The kernelized functional spatial depth
methods represent a novelty in FDA outlier detection
problems. We study the performances of our procedures In functional spaces a depth measure has the purpose of
in a simulation study and analyzing the NOx data set. We measuring the degree of centrality of curves relative to the
show this data set in Fig. 1, where it is already possible distribution of a functional random variable. Various

123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1117

  
functional depths have been proposed following two  /ðxÞ  /ðYÞ 
alternative approaches: a global approach, which implies KFSDðx; YÞ ¼ 1   E 
 k/ðxÞ  /ðYÞk ; ð3Þ
that the depth of an observation depends equally on all the
observations allowed by P on H, and a local approach, and it can be interpreted as a recoded version of
which instead makes depend the depth of an observation FSD(x, Y) since KFSDðx; YÞ ¼ FSDð/ðxÞ; /ðYÞÞ:
more on close than distant observations. Among the The sample version of (3) is given by
 
existing global-oriented depths there is the Fraiman and 1X /ðxÞ  /ðyi Þ 
n 
Muniz depth (FMD, Fraiman and Muniz 2001), the random KFSDðx; Yn Þ ¼ 1   :
n  i¼1 k/ðxÞ  /ðyi Þk
Tukey depth (RTD, Cuesta-Albertos and Nieto-Reyes
2008), the integrated dual depth (IDD, Cuevas and Fraiman Then, standard calculations (see Appendix) and (2) allow
2009), the modified band depth (MBD, López-Pintado and to provide an alternative expression of KFSDðx; Yn Þ, in this
Romo 2009) or the functional spatial depth (FSD, Chak- case in terms of j:
raborty and Chaudhuri 2014). Proposals of local-oriented
depths are instead the h-modal depth (HMD, Cuevas et al.

0 11=2
B C
1B
B X
n
jðx; xÞ þ jðyi ; yj Þ  jðx; yi Þ  jðx; yj Þ C
C
KFSDðx; Yn Þ ¼ 1  B pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiC ; ð4Þ
nB jðx; xÞ þ jðyi ; yi Þ  2jðx; yi Þ jðx; xÞ þ jðyj ; yj Þ  2jðx; yj ÞC
@ i; j ¼ 1; A
yi 6¼ x; yj 6¼ x

2006) or the kernelized functional spatial depth (KFSD, Note that (4) only requires the choice of j, and not of /,
Sguera et al. 2014). which can be left implicit. As j we use the Gaussian kernel
In this paper we focus on KFSD. Before giving its function given by
definition, we recall the definition of the functional spatial !
depth (FSD, Chakraborty and Chaudhuri 2014). Let H be kx  yk2
jðx; yÞ ¼ exp  ; ð5Þ
an infinite-dimensional Hilbert space, then for x 2 H and r2
the functional random variable Y 2 H, FSD of x relative to
where x; y 2 H. In turn, (5) depends on the norm function
Y is given by
   inherited by the functional Hilbert space where data are
 xY  assumed to lie, and on the bandwidth r. Regarding r, we
FSDðx; YÞ ¼ 1   E
 k x  Y k ;

initially consider nine different r, each one equal to 9
different percentiles of the empirical distribution of
where k  k is the norm inherited from the usual inner  
kyi  yj k; yi ; yj 2 Yn . The first percentile is 10 %, and by
product in H. For a n-size random sample of Y, i.e.,
Yn ¼ fy1 ; . . .; yn g, the sample version of FSD has the fol- increments of 10 we obtain the ninth percentile, i.e., 90 %.
lowing form: Note that the lower r, the more local the approach, and
  therefore the percentiles that we use cover different degrees
1 X x  yi 
n  of KFSD-based local approaches: strongly (e.g., 20 %),
FSDðx; Yn Þ ¼ 1   : ð1Þ
n  i¼1 kx  yi k moderately (e.g., 50 %) and weakly (e.g., 80 %) local
approaches. In Sect. 4 we present a method to select r in
As mentioned before, FSD is a global-oriented depth and outlier detection problems.
KFSD is a local version of it. KFSD is obtained writing (1) In general, since any functional depth measures the
in terms of inner products and then replacing the inner degree of centrality or extremality of a given curve relative
product function with a positive definite and stationary to a distribution or a sample, outliers are expected to have
kernel function. This replacement exploits the relationship low depth values. More in particular, in presence of low
magnitude, shape or partial outliers, an approach based on
jðx; yÞ ¼ h/ðxÞ; /ðyÞi; x; y 2 H; ð2Þ
the use of a local depth like KFSD may help in detecting
where j is the kernel j : H  H ! R, / is the embedding outliers. To illustrate this fact, we present the following
map / : H ! F and F is a feature space. Indeed, a defi- example: first, we generated 100 data sets of size 50 from a
nition of KFSD in terms of / can be given, that is, mixture of two stochastic processes, one for normal curves

123
1118 Stoch Environ Res Risk Assess (2016) 30:1115–1130

high magnitude contamination than with the best global depths (shape: FSD with 39.06 %;
−2 0 2 4 6 8

partial: FSD with 46.48 %). These results suggest that,


selecting a proper threshold, KFSD can isolate well
outliers.
0.0 0.2 0.4 0.6 0.8 1.0

shape contamination
−2 0 2 4 6 8

3 Outlier detection for functional data

The outlier detection problem can be described as follows:


0.0 0.2 0.4 0.6 0.8 1.0
let Yn ¼ fy1 ; . . .; yn g be a sample generated from a mixture
partial contamination of two functional random variables in H, one for normal
−2 0 2 4 6 8

curves and one for outliers, say Ynor and Yout , respectively.
Let Ymix be a mixture, i.e.,

0.0 0.2 0.4 0.6 0.8 1.0 Ynor ; with probability 1  a;
Ymix ¼ ð6Þ
Fig. 2 Examples of contaminated data sets: high magnitude contam-
Yout ; with probability a;
ination (top), shape contamination (middle) and partial contamination
where a 2 ½0; 1 is the contamination probability (usually, a
(bottom). The solid curves are normal curves and the dashed curves
are outliers value rather close to 0). The curves composing Yn are all
unlabeled, and the goal of the analysis is to decide whether
and one for high magnitude outliers, with the probability each curve is a normal curve or an outlier.
that a curve is an outlier equal to 0.05. Second, we gen- KFSD is a functional extension of the kernelized spatial
erated a group of 100 data sets from a mixture which depth for multivariate data (KSD) proposed by Chen et al.
produces shape outliers. Finally, we generated a group of (2009), who also proposed a KSD-based outlier detector
100 data sets from a mixture which produces partial out- that we generalize to KFSD: for a given data set Yn gen-
liers. In Fig. 2 we report a contaminated data set for each erated from Ymix and t 2 ½0; 1, the KFSD-based outlier
mixture. detector for x 2 H is given by
Let nout;j ; j ¼ 1; . . .; 100, be the number of outliers gen- 
1; if KFSD ðx; Yn Þ  t;
erated in the jth data set. For each data set and functional gðx; Yn Þ ¼ ð7Þ
0; if KFSD ðx; Yn Þ [ t;
depth, it is desirable to assign the nout;j lowest depth values
to the nout;j generated outliers. For each mixture and gen- where t is a threshold which allows to discriminate between
erated data set, we recorded how many times the depth of outliers (i.e., gðx; Yn Þ ¼ 1) and normal curves (i.e.,
an outlier is among the nout;j lowest values. As depth gðx; Yn Þ ¼ 0), and it is a parameter that needs to be set.
functions, we considered five global depths (FMD, RTD, For the multivariate case, KSD-based outlier detection is
IDD, MBD and FSD) and two local depths (HMD and carried under different scenarios. One of them consists in
KFSD). The results reported in Table 1 show that for all an outlier detection problem where two samples are
the functional depths the ranking of high magnitude out- available and the threshold t is selected by controlling the
liers is an easier task than the ranking of shape and partial probability that normal observations are classified as out-
outliers. However, while the ranking of high magnitude liers, i.e., the false alarm probability (FAP). The selection
outliers is reasonably good in different cases, e.g., for the criterion is based on a result providing a KSD-based
local KFSD (94.87 %) and the global RTD (90.17 %), the probabilistic upper bound on the FAP which depends on
ranking of shape and partial outliers is markedly better with t. Then, the threshold for KSD is provided by the maximum
local depths (shape: 86.72 % for KFSD and 85.47 % for value of t such that the upper bound does not exceed a
HMD; partial: 82.03 % for KFSD and 81.25 % for HMD) given desired FAP. We extend this result to KFSD:

Table 1 Percentages of times a


Type of depths Global depths Local depths
depth assigns a value among the
nout;j lowest ones to an outlier Depths FMD RTD IDD MBD FSD HMD KFSD

High magnitude outliers 86.32 90.17 81.62 69.23 68.80 85.47 94.87
Shape outliers 7.81 33.59 38.67 12.11 39.06 85.94 86.72
Partial outliers 18.75 44.53 34.77 19.14 46.48 81.25 82.03
Types of outliers: high magnitude, shape and partial

123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1119

Theorem 1 Let YnY ¼ fyi ; . . .; ynY g and ZnZ ¼ usually close to 0, that we advise to set equal to r.
fzi ; . . .; znZ g be i. i. d. samples generated from the unknown These least deep curves are deleted from the sample,
mixture of random variables Ymix 2 H described by (6), and simple resampling is carried out with the remain-
with a [ 0. Let gð; YnY Þ be the outlier detector defined in ing curves.
(7). Fix d 2 ð0; 1Þ and suppose that a  r for some 3. KFSD-based weighted resampling: once KFSDðyi ; Yn Þ;
r 2 ½0; 1. For a new random element x generated from i ¼ 1; . . .; n are obtained, weighted resampling is
Ynor , the following inequality holds with probability at least carried out with weights wi ¼ KFSDðyi ; Yn Þ.
1  d:
" sffiffiffiffiffiffiffiffiffiffiffiffi# All the above procedures generate samples with some
1 1 XnZ
ln 1=d repeated curves. However, in a preliminary stage of our
ExjYnY ½gðx; YnY Þ  gðzi ; YnY Þ þ ;
1  r nZ i¼1 2nZ study we observed that it is preferable to work with ZnZ
composed of non-repeated curves. To obtain such samples,
ð8Þ
we add a common smoothing step to the previous three
where ExjYnY refers to the expected value of x for a given resampling schemes.
YnY . To describe the smoothing step, first recall that each
curve in Yn is in practice observed at a discretized and finite
The proof of Theorem 1 is presented in the Appendix.
set of domain points, and that the sets may differ from one
Recall that the FAP is the probability that a normal
curve to another. For this reason, the estimation of Yn at a
observation x is classified as outlier. For the elements of
common set of m equidistant domain points may be
Theorem 1, PrxjYnY ðgðx; YnY Þ ¼ 1Þ is the FAP. Moreover,
required. Let ðyi ðs1 Þ; . . .; yi ðsm ÞÞ be the observed or esti-
PrxjYnY ðgðx; YnY Þ ¼ 1Þ ¼ ExjYnY ½gðx; YnY Þ: mated m-dimensional equidistant discretized version of yi ,
RYn be the covariance matrix of the discretized form of Yn
Therefore, the probabilistic upper bound of Theorem 1 and c be a smoothing parameter. Consider a zero-mean
applies also to the FAP. Gaussian process whose discretized form has cRYn as
It is worth noting that the application of Theorem 1 covariance matrix. Let ðfðs1 Þ; . . .; fðsm ÞÞ be a discretized
requires to observe two samples, circumstance rather realization of the previous Gaussian process. Consider any
uncommon in classical outlier detection problems, in which of the previous three resampling procedures and assume
usually a single sample generated from an unknown mix- that at the jth trial, j ¼ 1; . . .; nZ , the ith curve in Yn has
ture of random variables is available. For this reason, we been sampled. Then, the discretized form of the jth curve in
propose a solution which allows to use Theorem 1 in
ZnZ would be given by zj ðs1 Þ; . . .; zj ðsm Þ ¼ ðyi ðs1 Þþ
presence of a unique sample. Note that the general idea
fðs1 Þ; . . .; yi ðsm Þ þ fðsm ÞÞ, or, in functional form, by
behind holds also in the multivariate framework, and
zj ¼ yi þ f. Therefore, combining each resampling scheme
therefore it would enable to perform KSD-based outlier
with this smoothing step, we provide three different
detection when only a Rd -sample is available.
approximate ways to obtain ZnZ , and we refer to them as
In the functional context, our solution consists in setting
smo, tri and wei, respectively. Then, for fixed d, r and
YnY ¼ Yn and in obtaining ZnZ by resampling with
desired FAP, the threshold t for (7) is selected as the
replacement from Yn . Note that by doing this, and for
maximum value of t such that the right-hand side of (8)
sufficiently large values of nZ , we also obtain that the effect
does not exceed the desired FAP. Let t be the selected
of d on the probabilistic upper bound drastically reduces.
threshold, which is then used in (7) to compute gðyi ; Yn Þ,
Concerning r, that is the upper bound for the unknown
i ¼ 1; . . .; n. If gðyi ; Yn Þ ¼ 1, yi is detected as outlier. To
contamination probability a, a true range between 0 and 0.1
summarize, we provide three KFSD-based outlier detection
appears to be appropriate to cover most of the situations
procedures and we refer to them as KFSDsmo , KFSDtri and
found in practice. Regarding the resampling procedure to
KFSDwei depending on how ZnZ is obtained (smo, tri and
obtain ZnZ , we consider three different schemes, all of them
wei, respectively; recall that YnY ¼ Yn ). As competitors of
with replacement. Since we deal with potentially contam-
the proposed procedures, we consider the methods men-
inated data sets, besides simple resampling, we also con-
tioned in Sect. 1 that we now describe.
sider two robust KFSD-based resampling procedures
Sun and Genton (2011) proposed a depth-based func-
inspired by the work of Febrero et al. (2008). The three
tional boxplot and an associated outlier detection rule
resampling schemes that we consider are:
based on the ranking of the sample curves that MBD
1. Simple resampling. provides. The ranking is used to define a sample central
2. KFSD-based trimmed resampling: once KFSDðyi ; Yn Þ; region, that is, the smallest band containing at least half of
i ¼ 1; . . .; n are obtained, it is possible to identify the the deepest curves. The non-outlying region is defined
daT e% least deepest curves, for a certain 0\aT \1 inflating the central region by 1.5 times. Curves that do not

123
1120 Stoch Environ Res Risk Assess (2016) 30:1115–1130

belong completely to the non-outlying region are detected 4 Simulation study


as outliers. The original functional boxplot is based on the
use of MBD as depth, but clearly any functional depth can After introducing KFSDsmo , KFSDtri and KFSDwei , their
be used. Another contribution of this paper is the study of competitors (FBP, Btri , Bwei , FBG and FHD), as well as
the performances of the outlier detection rule associated to seven different functional depths (FMD, HMD, RTD, IDD,
the functional boxplot (from now on, FBP) when used MBD, FSD and KFSD), in this section we carry out a
together with the battery of functional depths mentioned in simulation study to evaluate the performances of the dif-
Sect. 2. ferent methods. For FBP, Btri and Bwei , we use the notation
Febrero et al. (2008) proposed two depth-based outlier procedure ? depth: for example, FBP ? FMD refers to the
detection procedures that select a threshold for FMD, HMD method obtained by using FBP together with FMD.
or IDD by means of two alternative robust smoothed To perform our simulation study, we consider six
bootstrap procedures whose single bootstrap samples are models: all of them generate curves according to the
obtained using the above described tri and wei, respec- mixture of random variables Ymix described by (6). The first
tively. At each bootstrap sample, the 1 % percentile of three mixture models (MM1, MM2 and MM3) share Ynor ,
empirical distribution of the depth values is obtained, say with curves generated by
p0:01 . If B is the number of bootstrap samples, B values of yðsÞ ¼ 4s þ ðsÞ; ð9Þ
p0:01 are obtained. Each method selects as cutoff c the
median of the collection of p0:01 and, using c as threshold, a where s 2 ½0; 1 and ðsÞ is a zero-mean Gaussian compo-
first outlier detection is performed. If some curves are nent with covariance function given by
detected as outliers, they are deleted from the sample, and
EððsÞ; ðs0 ÞÞ ¼ 0:25 exp ððs  s0 Þ2 Þ; s; s0 2 ½0; 1:
the procedure is repeated until no more outliers are found
(note that c is computed only in the first iteration). We refer Also the remaining three mixture models (MM4, MM5 and
to these methods as Btri and Bwei , and also in this case we MM6) share Ynor , but, in this case, the curves are generated
evaluate these procedures using all the functional depths by
mentioned in Sect. 2.
yðsÞ ¼ u1 sin s þ u2 cos s; ð10Þ
Finally, we also consider two procedures proposed by
Hyndman and Shang (2010) that are not based on the use of where s 2 ½0; 2p and u1 and u2 are observations from a
a functional depth. Both are based on the first two robust continuous uniform random variable between 0.05 and
functional principal components scores and on two different 0.15.
graphical representations of them. The first proposal is the MM1, MM2 and MM3 differ in their Yout components.
outlier detection rule associated to the functional bagplot Under MM1, the outliers are generated by
(from now on, FBG), which works as follows: obtain the yðsÞ ¼ 8s  2 þ ðsÞ;
bivariate robust scores and order them using the multi-
variate halfspace depth (Tukey 1975). Define an inner which produces outliers of both shape and low magnitude
region by considering the smallest region containing at least nature. Under MM2, the outliers are generated by adding to
the 50 % of the deepest scores, and obtain a non-outlying (9) an observation from a N(0, 1), and as result outliers are
region by inflating the inner region by 2.58 times. FBG more irregular than normal curves. Finally, under MM3,
detects as outliers those curves whose scores are outside the the outliers are generated by
non-outlying region. Note that the scores-based regions and yðsÞ ¼ 4 expðsÞ þ ðsÞ;
outliers allow to draw a bivariate bagplot, which produces a
functional bagplot once it is mapped onto the original which produces curves that are normal in the first part of
functional space. The second proposal is related to a dif- the domain, but that become exponentially outlying.
ferent graphical tool, the high density region boxplot (from Similarly, MM4, MM5 and MM6 differ in their Yout
now on, we refer to its associated outlier detection rule as components. Under MM4, the outliers are generated
FHD). In this case, once obtained the scores, perform a replacing u2 with u3 in (10), where u3 is an observation
bivariate kernel density estimation. Define the ð1  bÞ-high from a continuous uniform random variable between 0.15
density region (HDR), b 2 ð0; 1Þ, as the region of scores and 0.17. This change produces partial low magnitude
with coverage probability equal to ð1  bÞ. FHD detects as outliers in the first and middle part of the domain of the
outliers those curves whose scores are outside the ð1  bÞ- curves. Under MM5, the outliers are generated by adding
2
HDR. In this case, it is possible to draw a bivariate HDR to (10) an observation from a Nð0; 0:1
2 Þ, and they turn out
boxplot which can be mapped onto a functional version, to be more irregular curves. Finally, under MM6, the out-
thus providing the functional HDR boxplot. liers are generated by

123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1121

0:69s 1. FBP when used with FMD, HMD, RTD, IDD, MBD,
yðsÞ ¼ u1 sin s þ exp u4 cos s; ð11Þ FSD and KFSD: regarding FBP, as reported in Sect. 3,
2p
the central region is built considering the 50 % deepest
where u4 is an observation from a continuous uniform curves and the non-outlying region by inflating by 1.5
random variable between 0.1 and 0.15. As MM3, MM6 times the central region. Regarding the depths, for
allows outliers that are normal in the first part of the HMD, we follow the recommendations in Febrero
domain and become outlying with an exponential pattern.
et al. (2008), that is, H is the L2 space, jðx; yÞ ¼
In Fig. 3 we report a simulated data set with at least one 
2
outlier for each mixture model. p2ffiffiffiffi exp  kxyk2 and h is equal to the 15 % percentile
2p 2h
The details of the simulation study are the following: for  
of the empirical distribution of kyi  yj k; yi ; yj 2 Yn .
each mixture model, we generated 100 data sets, each one
composed of 50 curves. As mentioned above, for each For RTD and IDD, we work with 50 projections in
single samples Theorem 1 cannot be directly applied, and random Gaussian directions. For MBD, we consider
therefore KFSDsmo , KFSDtri and KFSDwei represent prac- bands defined by two curves. For FSD and KFSD, we
tical alternatives. Two values of the contamination proba- assume that the curves lie in the L2 space. Moreover, in
bility a were considered: 0.02 and 0.05. All curves were KFSD we set r equal to a moderately local percentile

generated using a discretized and finite set of 51 equidistant (50 %) of the empirical distribution of kyi  yj k; yi ; yj
points in the domain of each mixture model ([0, 1] for 2 Yn g.
MM1, MM2 and MM3; ½0; 2p for MM4, MM5 and MM6) 2. Btri and Bwei when used with FMD, HMD, RTD, IDD,
and the discretized versions of the functional depths were MBD, FSD and KFSD: c ¼ 0:05, B ¼ 200, aT ¼ a.
used. Regarding the depths, we use the specifications
In relation with the methods and the functional depths reported for FBP.
that we consider in the study, their specifications are 3. FBG: as reported in Sect. 3, the central region is built
described next: considering the 50 % deepest bivariate robust

MM1 MM4
−0.2 −0.1 0.0 0.1 0.2
6
4
2
0
−2

0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6

MM2 MM5
−1 0 1 2 3 4 5

0.0 0.1 0.2


−0.2

0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6

MM3 MM6
−0.1 0.0 0.1 0.2
6
4
2
0

−0.2

0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6

Fig. 3 Examples of contaminated functional data sets generated by MM1, MM2, MM3, MM4, MM5 and MM6. Solid curves are normal curves
and dashed curves are outliers

123
1122 Stoch Environ Res Risk Assess (2016) 30:1115–1130

functional principal component scores and the non-

6
outlying region by inflating by 2.58 times the central
region.

4
4. FHD: b ¼ a.
5. KFSDsmo , KFSDtri and KFSDwei : nY ¼ n ¼ 50 (since

2
YnY ¼ Yn ), c ¼ 0:05, aT ¼ a (only for KFSDtri ),
nZ ¼ 6n, d ¼ 0:05, r ¼ a, desired FAP = 0.10. More-

0
over, as introduced in Sect. 2, for these methods we
consider 9 percentiles to set r in KFSD. The way in

−2
which we propose to choose the most suitable 0.0 0.2 0.4 0.6 0.8 1.0
percentile for outlier detection is presented below.
Fig. 4 Example of a training sample of peripheral curves for a
In supervised classification, the availability of training contaminated data set generated by MM1 with a ¼ 0:05. The solid
curves with known class memberships makes possible the and shaded curves are the original curves (both normal and outliers).
definition of some natural procedures to set r for KFSD, The dashed curves are the peripheral curves to use as training sample
such as cross-validation. However, in an outlier detection
problem, it is common to have no information whether
curves are normal or outliers. Therefore, training proce- pk 2 fp1 ; . . .; pK g, compute KFSDpk ðyðiÞ;j ; YðiÞ;j Þ, where
dures are not immediately available.  
YðiÞ;j ¼ Yn n yðiÞ;j . At the end, a L  K matrix is
We propose to overcome this drawback by obtaining a
obtained, say DLK ¼ fdlk gl ¼ 1; . . .L; whose kth column
‘‘training sample of peripheral curves’’, and then choosing
the percentile that ranks better these peripheral curves as k ¼ 1; . . .; K
final percentile for KFSD in KFSDsmo , KFSDtri and is composed of the KFSD values of the L training
KFSDwei . We now describe this procedure, which is based peripheral curves when the kth percentile is employed in
on J replications. Let Yn be the functional data set on which KFSD. Next, let rlk be the rank of dlk in the vector
outlier detection has to be done and let YðnÞ ¼ KFSDpk ðy1 ; Yn Þ; . . .; KFSDpk ðyn ; Yn Þ;dlk Þ, e.g., rlk is equal
 
yð1Þ ; . . .; yðnÞ be the depth-based ordered version of Yn , to 1 or n þ 1 if dlk is the minimum or the maximum value
where yð1Þ and yðnÞ are the curves with minimum and in the vector, respectively. Let RLK be the result of this
maximum depth, respectively. The steps to obtain a set of transformation of DLK , and sum the elements of each col-
peripheral curves are the following: umn, obtaining a K-dimensional vector, say RK . Since the
goal is to assign ranks as low as possible to the peripheral
I. Let fp1 ; . . .; pK g be the set of percentiles in use (in curves, choose the percentile associated to the minimum
our case, as explained in Sect. 2, pk ¼ ð10kÞ %, value of RK . When a tie is observed, we break it randomly.
k 2 f1; . . .; K ¼ 9gÞ, and choose randomly a per- The comparison among methods is performed in terms
centile from the set. For the jth replication, of both correct and false outlier detection percentages,
j 2 f1; . . .; J g, denote the selected percentile as p j . which are reported in Tables 2, 3, 4, 5, 6 and 7. To ease the
We use J ¼ 20 in the rest of the paper. reading of the tables, for each model and a, we report in
II. Using p j , compute KFSDp j ðyi ; Yn Þ, i ¼ 1; . . .; n, bold the five best correct outlier detection percentages (c).1
where the notation KFSDp j ð; Þ is used to describe For each model, if a method is among the five best ones for
what percentile is used. For the jth replication, both contamination probabilities a, we report its label in
denote the KFSD-based ordered curves as bold.
yð1Þ;j ; . . .; yðnÞ;j . The results in Tables 2, 3, 4, 5, 6 and 7 show that:
III. Take yð1Þ;j ; . . .; yðlj Þ;j , where lj  Binðn; 1nÞ. Apply the
1. KFSDtri and KFSDwei are always among the five best
smoothing step described in Sect. 3 to these curves.
methods. KFSDsmo is among the five best methods 10
For the smoothing step, we use RYn and c ¼ 0:05.
times over 12, but when its performance is not among
For the jth replication, denote the peripheral and
the five best, it is neither extremely far from the fifth
smoothed curves as yð1Þ;j ; . . .; yðlj Þ;j .
method (MM2, a ¼ 0:05: 95.18 % against 96.79 %;
IV. Repeat J times steps I–III. to obtain a collection of MM3, a ¼ 0:05: 73.79 % against 78.63 %). The rest of
P
L ¼ Jj¼1 lj peripheral curves, say YL (for an exam- the methods are among the five best procedures at most
ple, see Fig. 4). four times over 12 (FBP ? HMD and Btri ? HMD).
Next, YL acts as training sample according to the fol- 1
In presence of tie, the method with lower false outlier detection
lowing steps: for each yðiÞ;j 2 YL , (i  lj ), and percentage (f) is preferred.

123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1123

Table 2 MM1, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier Table 3 MM2, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier
detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo , detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo ,
KFSDtri and KFSDwei KFSDtri and KFSDwei
a ¼ 0:02 a ¼ 0:05 a ¼ 0:02 a ¼ 0:05
c f c f c f c f

FBP ? FMD 44.34 1.23 43.86 0.73 FBP ? FMD 99.09 1.08 96.39 0.84
FBP ? HMD 74.53 0.94 72.81 0.61 FBP ? HMD 96.36 0.96 96.39 0.88
FBP ? RTD 61.32 0.57 63.16 0.31 FBP ? RTD 99.09 0.61 94.78 0.25
FBP ? IDD 55.66 0.61 61.84 0.34 FBP ? IDD 99.09 0.70 95.18 0.38
FBP ? MBD 49.06 1.33 50.44 0.69 FBP ? MBD 99.09 1.06 96.39 0.82
FBP ? FSD 62.26 0.67 61.84 0.40 FBP ? FSD 99.09 0.57 94.78 0.36
FBP ? KFSD 66.04 0.86 74.12 0.44 FBP ? KFSD 98.18 0.63 93.98 0.36
Btri ? FMD 0.00 0.98 0.00 1.82 Btri ? FMD 0.00 1.06 0.00 1.96
Btri ? HMD 66.98 1.45 57.89 1.47 Btri HMD 95.45 1.51 96.79 1.68
Btri ? RTD 10.38 1.78 14.91 1.76 Btri ? RTD 1.82 1.92 6.83 2.61
Btri ? IDD 10.38 1.55 11.84 1.74 Btri ? IDD 5.45 1.60 7.63 1.94
Btri ? MBD 0.00 0.51 0.00 1.49 Btri ? MBD 0.00 0.98 0.40 2.10
Btri ? FSD 2.83 0.76 5.26 1.17 Btri ? FSD 4.55 1.06 5.22 1.62
Btri ? KFSD 70.75 1.43 58.77 1.40 Btri ? KFSD 97.27 1.60 95.18 1.52
Bwei ? FMD 0.00 1.29 0.00 1.49 Bwei ? FMD 0.00 1.27 0.00 1.52
Bwei ? HMD 71.70 1.02 47.37 0.65 Bwei ? HMD 95.45 1.02 86.35 0.36
Bwei ? RTD 13.21 2.04 13.60 1.78 Bwei ? RTD 5.45 2.21 8.43 2.84
Bwei ? IDD 17.92 1.82 10.53 1.55 Bwei ? IDD 7.27 1.49 9.64 2.36
Bwei ? MBD 0.00 1.08 0.00 1.40 Bwei ? MBD 0.00 1.27 0.40 1.49
Bwei ? FSD 2.83 1.39 3.95 1.07 Bwei ? FSD 8.18 1.39 4.02 1.37
Bwei ? KFSD 61.32 0.88 55.26 0.48 Bwei ? KFSD 95.45 0.96 79.52 0.51
FBG 100.00 2.27 97.81 2.37 FBG 8.18 3.07 4.42 2.95
FHD 48.11 1.00 73.68 2.77 FHD 7.27 1.88 12.45 5.66
KFSDsmo 89.62 4.50 85.09 2.58 KFSDsmo 100.00 3.91 95.18 2.76
KFSDtri 89.62 4.92 92.11 4.40 KFSDtri 100.00 5.19 97.99 4.84
KFSDwei 97.17 9.44 96.93 6.54 KFSDwei 100.00 9.20 99.60 6.48

2. Regarding MM5 and MM6, our procedures are clearly percentages are however something expected in
the best options in terms of correct detection (c), and in KFSDsmo , KFSDtri and KFSDwei since these methods
the following order: KFSDwei , KFSDtri and KFSDsmo . are based on the definition of a desired false alarm
In general, this pattern is observed overall the simu- probability, which is equal to 10 % in this study.
lation study. Note that for MM6 and a ¼ 0:02 we Concerning MM2, we observe similar results to MM3,
observe the best relative performances of KFSDsmo , but in this case the performances of the best methods in
KFSDtri and KFSDwei , i.e., 91.58, 93.68 and 96.84 %, terms of correct detection (KFSDsmo , KFSDtri ,
respectively, against 71.58 % of the fourth best method KFSDwei , FBP-based methods and Btri when used with
(Bwei ? KFSD), that is, we observe at least 20 % local depths) are closer to each other.
differences. Finally, there are only two cases in which a competitor
3. About MM3, KFSDwei is clearly the best method in outperforms all our methods, and it is FBAG under
terms of correct detection, however at the price of MM1 and both a. However, this procedure does not
having a greater false detection (f). This is in general show a behavior as stable as KFSDsmo , KFSDtri and
the main weak point of KFSDsmo , KFSDtri and KFSDwei do. Indeed, FBAG shows poor performances
KFSDwei . As for correct detection, we observe a under other models, e.g., MM2.
overall pattern in our methods in false detection, but in In summary, the above results and remarks show that the
an opposite way, indicating therefore a trade-off proposed KFSD-based procedures are the best methods in
between c and f. Relative high false detection detecting outliers for the considered models. Moreover,

123
1124 Stoch Environ Res Risk Assess (2016) 30:1115–1130

Table 4 MM3, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier Table 5 MM4, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier
detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo , detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo ,
KFSDtri and KFSDwei KFSDtri and KFSDwei
a ¼ 0:02 a ¼ 0:05 a ¼ 0:02 a ¼ 0:05
c f c f c f c f

FBP ? FMD 65.69 0.92 49.19 0.97 FBP ? FMD 1.02 0.00 0.00 0.00
FBP 1 HMD 89.22 0.57 85.89 0.63 FBP ? HMD 6.12 0.00 1.60 0.02
FBP ? RTD 86.27 0.45 76.61 0.34 FBP ? RTD 0.00 0.00 0.00 0.00
FBP ? IDD 79.41 0.51 70.56 0.38 FBP ? IDD 0.00 0.00 0.00 0.00
FBP ? MBD 74.51 0.88 59.27 0.84 FBP ? MBD 0.00 0.00 0.00 0.00
FBP ? FSD 79.41 0.51 73.79 0.42 FBP ? FSD 0.00 0.00 0.00 0.00
FBP 1 KFSD 89.22 0.57 83.06 0.59 FBP ? KFSD 2.04 0.00 0.80 0.00
Btri ? FMD 2.94 0.73 5.24 1.22 Btri ? FMD 60.20 0.16 47.60 0.11
Btri ? HMD 57.84 1.57 53.63 1.56 Btri ? HMD 41.84 0.04 18.80 0.17
Btri ? RTD 15.69 1.76 21.37 1.81 Btri ? RTD 54.08 1.16 34.80 0.82
Btri ? IDD 20.59 1.65 20.56 1.70 Btri ? IDD 55.10 1.02 37.20 0.59
Btri ? MBD 0.98 1.06 3.23 1.54 B tri 1 MBD 64.29 0.14 46.40 0.13
Btri ? FSD 16.67 1.14 17.34 1.22 Btri ? FSD 68.37 0.14 45.60 0.08
Btri ? KFSD 57.84 1.63 49.19 1.52 Btri ? KFSD 58.16 0.20 28.00 0.13
Bwei ? FMD 2.94 1.10 3.63 0.84 Bwei ? FMD 51.02 0.12 23.60 0.00
Bwei ? HMD 60.78 1.25 42.74 0.76 Bwei ? HMD 38.78 0.06 10.80 0.02
Bwei ? RTD 15.69 1.92 17.34 1.73 Bwei ? RTD 37.76 0.49 25.20 0.15
Bwei ? IDD 23.53 1.33 14.52 1.22 Bwei ? IDD 43.88 0.67 28.00 0.42
Bwei ? MBD 0.98 1.29 2.82 1.14 Bwei ? MBD 56.12 0.10 25.20 0.02
Bwei ? FSD 15.69 1.16 12.10 0.84 Bwei ? FSD 63.27 0.06 29.20 0.00
Bwei ? KFSD 56.86 1.12 41.53 0.67 Bwei ? KFSD 58.16 0.12 21.20 0.00
FBG 86.27 2.65 78.63 1.73 FBG 9.18 0.53 6.80 1.09
FHD 49.02 1.02 65.73 2.88 FHD 51.02 1.02 37.60 4.34
KFSDsmo 89.22 3.90 73.79 2.95 KFSDsmo 87.76 2.16 50.00 1.24
KFSDtri 90.20 4.63 83.47 4.71 KFSDtri 91.84 3.00 64.80 2.91
KFSDwei 97.06 8.96 90.32 6.50 KFSDwei 95.92 5.08 62.00 3.35

KFSDtri seems the most reasonable choice to balance the mixture models with linear mean functions (MM1, MM2
mentioned trade-off between c and f. In terms of correct and MM3). Finally, the percentiles selected by means of
detection, KFSDwei slightly outperforms KFSDtri , which the proposed training procedure seem to vary among data
however shows very good and stable performances when sets. However, except for MM3 and a ¼ 0:02, at least for
compared with the remaining methods. In terms of false half of the data sets a percentile not greater than the median
detection, KFSDtri considerably improves on KFSDwei , has been chosen, which implies at most a moderately local
especially under some models (e.g., see MM2). approach.
In Fig. 5 we report a series of boxplots summarizing
which percentiles have been selected in the training steps
for KFSDsmo , KFSDtri and KFSDwei , and the following 5 Real data study: nitrogen oxides (NOx ) data
general remarks can be made. First, MM6 is the mixture
model for which lower percentiles have been selected, and Besides simulated data, we consider a real data set which
it is also a scenario in which our methods considerably consists in nitrogen oxides (NOx ) emission level daily
outperform their competitors. The need for a more local curves measured every hour close to an industrial area in
approach for MM6-data may explain the two observed Poblenou (Barcelona) and is available in the R package
facts about this mixture model. Second, lower and more fda.usc (Febrero and Oviedo de la Fuente 2012). Outlier
local percentiles have been chosen for mixture models with detection on this data set was first performed by Febrero
nonlinear mean functions (MM4, MM5 and MM6) than for et al. (2008) where these authors proposed Btri and Bwei .

123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1125

Table 6 MM5, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier Table 7 MM6, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier
detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo , detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo ,
KFSDtri and KFSDwei KFSDtri and KFSDwei
a ¼ 0:02 a ¼ 0:05 a ¼ 0:02 a ¼ 0:05
c f c f c f c f

FBP ? FMD 55.56 0.00 54.00 0.00 FBP ? FMD 48.42 0.00 44.19 0.00
FBP ? HMD 66.67 0.00 68.40 0.04 FBP ? HMD 60.00 0.18 62.92 0.00
FBP ? RTD 57.58 0.00 54.40 0.00 FBP ? RTD 55.79 0.00 54.68 0.00
FBP ? IDD 52.53 0.00 56.00 0.00 FBP ? IDD 46.32 0.00 40.07 0.00
FBP ? MBD 55.56 0.00 55.20 0.00 FBP ? MBD 48.42 0.00 45.69 0.00
FBP ? FSD 55.56 0.00 55.60 0.00 FBP ? FSD 52.63 0.00 52.43 0.00
FBP ? KFSD 60.61 0.00 59.20 0.00 FBP ? KFSD 57.89 0.00 56.93 0.00
Btri ? FMD 3.03 0.18 2.80 0.44 Btri ? FMD 29.47 0.22 33.71 0.32
B tri 1 HMD 97.98 0.12 92.40 0.11 Btri ? HMD 71.58 0.24 45.69 0.15
Btri ? RTD 16.16 1.06 20.00 1.03 Btri ? RTD 35.79 0.82 31.09 0.51
Btri ? IDD 18.18 1.06 16.00 1.07 Btri ? IDD 38.95 0.37 35.96 0.74
Btri ? MBD 2.02 0.16 3.20 0.32 Btri ? MBD 29.47 0.24 31.09 0.32
Btri ? FSD 29.29 0.18 27.20 0.23 Btri ? FSD 52.63 0.20 43.82 0.19
Btri ? KFSD 93.94 0.24 92.40 0.21 Btri ? KFSD 71.58 0.22 50.56 0.21
Bwei ? FMD 3.03 0.29 2.40 0.23 Bwei ? FMD 23.16 0.24 19.48 0.08
Bwei ? HMD 93.94 0.08 73.60 0.00 Bwei ? HMD 68.42 0.12 35.96 0.00
Bwei ? RTD 15.15 1.06 17.60 1.12 Bwei ? RTD 38.95 0.69 24.34 0.51
Bwei ? IDD 25.25 0.98 20.00 0.99 Bwei ? IDD 33.68 0.59 25.09 0.40
Bwei ? MBD 2.02 0.20 3.60 0.21 Bwei ? MBD 24.21 0.18 19.85 0.13
Bwei ? FSD 29.29 0.14 21.60 0.13 Bwei ? FSD 47.37 0.16 27.72 0.08
Bwei ? KFSD 83.84 0.08 72.00 0.04 Bwei ? KFSD 66.32 0.12 44.19 0.06
FBG 0.00 1.02 0.40 0.04 FBG 17.89 0.02 14.98 0.06
FHD 4.04 1.96 12.80 5.64 FHD 52.63 1.02 61.80 2.85
KFSDsmo 98.99 1.82 94.00 0.44 KFSDsmo 91.58 2.08 71.16 0.95
KFSDtri 98.99 2.61 98.00 2.11 KFSDtri 93.68 2.69 82.02 2.49
KFSDwei 100.00 4.61 98.40 2.11 KFSDwei 96.84 4.69 83.15 2.75

We carry on their study considering more methods and original data set. First, the W curves have in general higher
depths. values than NW curves, which can be explained by the
NOx are one of the most important pollutants, and it is greater activity of motor vehicles and industries in a city
important to identify outlying trajectories because these like Barcelona during working days. Second, both data sets
curves may compromise any statistical analysis or be of contain curves with peaks, but for W curves the peaks
special interest for further analysis and to implement occur roughly around 7-8 a.m. and during many days,
environmental political countermeasures. The NOx levels whereas for NW curves the peaks occur later and during
that we consider were measured in lg=m3 every hour of few days, which again can be explained by the differences
every day for the period 23/02/2005–26/06/2005. Only for between Barcelona’s economic activity of working and
115 days of the period are available the 24 measurements, nonworking days.
and these are the days that compose the final NOx data set. At first glance, each data set may contain outliers,
Moreover, following Febrero et al. (2008), since the NOx especially partial outliers in the form of abnormal peaks,
data set includes working as well as nonworking days, it and therefore a local depth approach by means of KFSDsmo ,
seems more appropriate to consider a first sample of 76 KFSDtri and KFSDwei appears to be a good strategy to
working day curves (from now on, W) and a second sample detect outliers. Besides them, we do outlier detection with
of 39 nonworking day curves (from now on, NW). Both W all the methods used in Sect. 4. For all the procedures we
and NW are showed in Fig. 6, where it is possible to use the same specifications as in Sect. 4, and we assume
appreciate at least two facts that justify the split of the a ¼ 0:05. For each method, we report the labels of the

123
1126 Stoch Environ Res Risk Assess (2016) 30:1115–1130

Fig. 5 Boxplots of the 9


percentiles selected in the
training steps of the simulation 8
study for KFSDsmo , KFSDtri and
KFSDwei 7

percentiles
5

MM1, 0.02

MM1, 0.05

MM2, 0.02

MM2, 0.05

MM3, 0.02

MM3, 0.05

MM4, 0.02

MM4, 0.05

MM5, 0.02

MM5, 0.05

MM6, 0.02

MM6, 0.05
W Table 8 NOx data, Working and Nonworking data sets. Curves
0 100 200 300 400

detected as outliers by FBP, Btri , Bwei , FBG, FHD, KFSDsmo , KFSDtri


and KFSDwei
Detected outliers
Working days Nonworking days
0 5 10 15 20
FBP ? FMD – –
NW FBP ? HMD 12, 16, 37 5, 7, 20, 21
0 100 200 300 400

FBP ? RTD 37 20
FBP ? IDD – 5, 7, 20
FBP ? MBD – –
FBP ? FSD 37 –
0 5 10 15 20 FBP ? KFSD 12, 16, 37 5, 7, 20, 21
Btri ? FMD 16, 37 7
Fig. 6 NOx data: working (top) and non working (bottom) day curves Btri ? HMD 14, 16, 37 7, 20
Btri ? RTD 16 7, 20
curves detected as outliers in Table 8 and we highlight Btri ? IDD 16, 37 7, 20
these curves in Fig. 7. Btri ? MBD 16, 37 7
Btri ? FSD 14, 16, 37 –
Concerning W, most of the methods detect as outlier day Btri ? KFSD 12, 14, 16, 37 7, 20
37, the Friday at the beginning of the long weekend due to Bwei ? FMD 16 7, 20
Labor’s day in 2005 and whose curve shows a partial Bwei ? HMD 16, 37 7, 20
outlying behavior before noon and at the end of the day. Bwei ? RTD 16 –
Another day detected as outlier by many methods is day Bwei ? IDD 16, 37 20
16, another Friday before a long weekend, Easter holidays Bwei ? MBD 16 7
in 2005, and whose curve has the highest morning peak. In Bwei ? FSD 16, 37 –
addition to curves 16 and 37, KFSDsmo detects as outlier Bwei ? KFSD 16, 37 7, 20
curve 14, as other nine methods do, recognizing a seem- FBG 16, 37 –
ingly outlying pattern in early hours of the day. Addi- FHD 12, 14, 16, 37 7, 20
tionally, KFSDtri includes among the outliers also day 12, KFSDsmo 14, 16, 37 7, 20, 21
which may be atypical because of its behavior in early KFSDtri 12, 14, 16, 37 7, 20, 21
afternoon. Note that both day 12 and 14 are in the week KFSDwei 11, 12, 13, 14, 15, 16, 37, 38 7, 20, 21
before the above-mentioned Easter holidays. Finally,

123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1127

Fig. 7 NOx data set, curves W

400
detected as outliers in Table 8: 11: WE, 09/03/2005
12: FR, 11/03/2005
working (top) and nonworking 13: TU, 15/03/2005

300
14: WE, 16/03/2005
(bottom) days 15: TH, 17/03/2005
16: FR, 18/03/2005
37: FR, 29/04/2005
38: MO, 02/05/2005

200
100
0
0 5 10 15 20

NW

50 100 150 200 250


5: SA, 12/03/2005
7: SA, 19/03/2005
20: SA, 30/04/2005
0 21: SU, 01/05/2005

0 5 10 15 20

KFSDwei detects as outliers the greatest number of curves. percentile, and their results partially resemble the ones of
This last result may appear exaggerated, but all the curves the previously mentioned local techniques.
that are outliers according to KFSDwei seem to have some
partial deviations from the majority of curves. For exam-
ple, day 13, whose curve is considered normal by the rest 6 Conclusions
of the procedures, shows a peak at end of the day. Similar
peaks can be observed also in other curves detected as This paper proposes to tackle outlier detection in functional
outliers by other methods (e.g., days 16 and 37), which samples using the kernelized functional spatial depth as a
means that it may be occurring a masking effect to day 13’s tool. In Theorem 1 we presented a probabilistic result
detriment, and only KFSDwei points out this possibly out- allowing to set a KFSD-threshold to identify outliers, but in
lying feature of the curve. Regarding the training step for practice it is necessary to observe two samples to apply
KFSD to set r, it gives as result the 70 % percentile. Theorem 1. To overcome this practical limitation, we
Observing the first graph of Fig. 6, it can be noticed that proposed KFSDsmo , KFSDtri and KFSDwei which are
some curves have a likely outlying behavior, and this may methods that can be applied when a unique functional
be the reason why a weakly local approach for KFSD may sample is available and are based on both a probabilistic
be adequate enough. approach and smoothed resampling techniques.
In the case of NW, some methods detect no curves as We also proposed a new procedure to set the bandwidth
outliers (e.g., all the FSD-based methods), exclusively r of KFSD that is based on obtaining training samples by
three FBP-based methods flag day 5 as outlier, whereas means of smoothed resampling techniques. The general
days 7, 20 and 21 are detected as outliers by our methods as idea behind this procedure can be applied to other func-
well as others. Note that day 7 is the Saturday before Easter tional depths or methods with parameters that need to be
and days 20 and 21 are Labor’s day eve and the same set.
Labor’s day. Days 7 and 20, which have two peaks, at the We investigated the performances of KFSDsmo , KFSDtri
beginning and end of the day, are also flagged by other and KFSDwei by means of a simulation study. We focused
twelve and eight methods, respectively, while day 21, on challenging scenarios with low magnitude, shape and
which shows a single peak in the first hours of the day, is partial outliers instead of high magnitude outliers. The
considered atypical by only two other methods, which results support our proposals. Along the simulation study,
happen to be local (FBP ? HMD and FBP ? KFSD). This KFSDsmo , KFSDtri and KFSDwei attained the largest correct
last result may be connected with what has been observed detection performances in most of the analyzed setups, but
at the KFSD training step for selecting the percentile, i.e., in some cases they paid a price in terms of false detection.
the selection of the 30 % percentile. Therefore, KFSDsmo , However, KFSDsmo , KFSDtri and KFSDwei work with a
KFSDtri and KFSDwei work with a strongly local given desired false alarm probability, and therefore higher

123
1128 Stoch Environ Res Risk Assess (2016) 30:1115–1130

false detection percentages than their competitors are due  2  2


X2
x  yi   x  y1 x  y2 
 
to the inherent structure of the methods. We also observed   ¼ þ
kx  y k kx  y k

 i kx  yi k 1 2
a trade-off between c and f for KFSDsmo , KFSDtri and 
KFSDwei , and a clear pattern. For these reasons in our  x  y1

opinion KFSDtri should be preferred to KFSDsmo or ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 hx; xi þ hy1 ; y1 i  2hx; y1 i
KFSDwei since it performs extremely well in terms of 2
correct detection, while it has lower false detection per- x  y2 

centages than KFSDwei . Concerning the remaining meth- þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
hx; xi þ hy2 ; y2 i  2hx; y2 i
ods, there are competitors that in few scenarios
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
outperformed our methods. However, in these few cases Let d1 ¼ hx; xi þ hy1 ; y1 i  2hx; y1 i and d2 ¼
the differences are not great, and in addition these com- pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
hx; xi þ hy2 ; y2 i  2hx; y2 i. Then,
petitors are not stable across the considered scenarios.  2
Furthermore, we also showed that our procedures can be X 2
x  yi 
 
applied in environmental contexts with an example where  
 i kx  yi k
the goal was to detect outlying NOx curves to identify days  
possibly characterized by abnormal pollution levels.  x  y 1 x  y 2 2
¼  þ 
To conclude, we present two possible future research d1 d2 
   
lines. First, since KFSD is a depth whose local approach is x  y1  x  y2 
¼   þ   þ 2 hx  y1 ; x  y2 i
in part based on the choice of the kernel function, it would d1  d2  d1 d2
be interesting to explore how the choice of different kernels 2
affects the behavior of KFSD. Moreover, each kernel will ¼2þ ðhx; xi þ hy1 ; y2 i  hx; y1 i  hx; y2 iÞ
d1 d2
depend on a bandwidth and a norm. For the selection of the X 2
hx; xi þ hyi ; yj i  hx; yi i  hx; yj i
bandwidth, we used a criterion based on the study of the ¼ ;
empirical distribution of the sample distances, but alter- i;j¼1
di dj
natives should be investigated, for example an adaptation
and apply the embedding map / to all the observations of
of the so-called Silverman’s rule (Silverman 1986) for
the last expression. According to (2), this is equivalent to
selecting the bandwidth of a kernel-based functional depth
substitute the inner product function with a positive definite
such as KFSD. For the choice of the norm, a sensitivity
and stationary kernel function j, which explains the defi-
study would help in understanding how important is the
nition of KFSDðx; Yn Þ in (4) for n ¼ 2. The generalization
functional space assumption. Second, since outlier detec-
of this result to n [ 2 is straightforward.
tion can be seen as a special case of cluster analysis (it is a
cluster problem with maximum two clusters, and one of
Proof of theorem 1
them with size much smaller than the other,even 0), a
natural step ahead in our research may be the definition of
As explained in Sect. 3, Theorem 1 is a functional exten-
KFSD-based cluster analysis procedures.
sion of a result derived by Chen et al. (2009) for KSD, and
Acknowledgments The authors would like to thank the editor in since they are closely related, next we report a sketch of the
chief, the associate editor and an anonymous referee for their helpful proof of Theorem 1. The proof for KSD is mostly based on
comments. This research was partially supported by Spanish Ministry an inequality known as McDiarmid ’s inequality
of Science and Innovation grant ECO2011-25706 and by Spanish (McDiarmid 1989), which also applies to general proba-
Ministry of Economy and Competition grant ECO2012-38442.
bility spaces, and therefore to functional Hilbert spaces.
We report this inequality in the next lemma:
Lemma 1 (McDiarmid [1.2]) Let X1 ; . . .; Xn be proba-
Appendix Q
bility spaces. Let X ¼ nj¼1 Xj and let X : X ! R be a
random variable. For any j 2 f1; . . .; ng, let
From FSDðx; Yn Þ to KFSDðx; Yn Þ
ðx1 ; . . .; xj ; . . .; xn Þ and x1 ; . . .; x
^j ; . . .; xn be two ele-
To show how to pass from FSDðx; Yn Þ in (1) to ments of X that differ only in their jth coordinates. Assume
KFSDðx; Yn Þ in (4), we first show that FSDðx; Yn Þ can be that X is uniformly difference-bounded by fcj g, that is, for
expressed in terms of inner products. We present this result any j 2 f1; . . .; ng,
 
for n ¼ 2. The norm in (1) can be written as  X x1 ; . . .; xj ; . . .; xn  X x1 ; . . .; x
^j ; . . .; xn   cj :
ð12Þ

123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1129

Then, if E½X exists, for any s [ 0 Then, since Eðz1  Ynor ÞjYnY ½gðz1 ; YnY Þ ¼ ExjYnY ½gð x; YnY Þ, for
! a [ 0,
2s2
Prð X  E½X sÞ  exp Pn 2 : 1
j¼1 cj ExjYnY ½gð x; YnY Þ  Eðz  Y ÞjY ½gðz1 ; YnY Þ: ð16Þ
1  a 1 mix nY
In order to apply Lemma 1 to our problem, define Consequently, combining (15) and (16), and for r a, we
obtain
1 X nZ
sffiffiffiffiffiffiffiffiffiffiffiffi#!
Xðz1 ; . . .; znZ Þ ¼  gðzi ; YnY jYnY Þ; ð13Þ "
nZ i¼1 1 1X nZ
ln 1=d
Pr ExjYnY ½gðx; YnY Þ gðzi ; YnY Þ þ
1  r nZ i¼1 2nZ
whose expected value is given by
" #  1  d;
1 X nZ
E½X ¼ Ezi jYnY  gðzi ; YnY jYnY Þ
nZ i¼1 which completes the proof. h
¼ Ez1 jYnY ½gðz1 ; YnY jYnY Þ: ð14Þ

Now, for any j 2 f1; . . .; nZ g and z^j 2 H, the following


inequality holds References
 
 Xðz1 ; . . .; zj ; . . .; zn Þ  Xðz1 ; . . .; z^j ; . . .; zn Þ  1 ; Barnett V, Lewis T (1994) Outliers in statistical data, vol 3. Wiley,
Z Z
nZ New York
Chakraborty A, Chaudhuri P (2014) On data depth in infinite
and it provides assumption (12) of Lemma 1. Therefore, for dimensional spaces. Ann Inst Stat Math 66:303–324
any s [ 0 Chen Y, Dang X, Peng H, Bart HL (2009) Outlier detection with the
! kernelized spatial depth function. IEEE Trans Pattern Anal Mach
1 X nZ
Intell 31:288–305
Pr Ez1 jYnY ½gðz1 ; YnY jYnY Þ  gðzi ; YnY jYnY Þ s Cuesta-Albertos JA, Nieto-Reyes A (2008) The random Tukey depth.
nZ i¼1
Comput Stat Data Anal 52:4979–4988
 exp 2nZ s2 ; Cuevas A (2014) A partial overview of the theory of statistics with
functional data. J Stat Plan Inference 147:1–23
and by the law of total probability Cuevas A, Fraiman R (2009) On depth measures and dual statistics. A
" !# methodology for dealing with general data. J Multivar Anal
1 X nZ 100:753–766
E Pr Ez1 jYnY ½gðz1 ; YnY jYnY Þ  gðzi ; YnY jYnY Þ s Cuevas A, Febrero M, Fraiman R (2006) On the use of the bootstrap
nZ i¼1 for estimating functions with functional data. Comput Stat Data
! Anal 51:1063–1074
1 XnZ
¼ Pr Ez1 jYnY ½gðz1 ; YnY Þ  gðzi ; YnY Þ s Febrero M, Oviedo de la Fuente M (2012) Statistical computing in
nZ i¼1 functional data analysis: the R package fda.usc. J Stat Softw 51:1–28
Febrero M, Galeano P, González-Manteiga W (2007) A functional
 exp 2nZ s2 analysis of NOx levels: location and scale estimation and outlier
detection. Comput Stat 22:411–427
Febrero M, Galeano P, González-Manteiga W (2008) Outlier
Next, setting d ¼ expð2nZ s2 Þ, and solving for s, the fol-
detection in functional data by depth measures, with application
lowing result is obtained: to identify abnormal NOx levels. Environmetrics 19:331–345
sffiffiffiffiffiffiffiffiffiffiffiffi Ferraty F, Vieu P (2006) Nonparametric functional data analysis:
ln 1=d theory and practice. Springer, New York
s¼ : Fraiman R, Muniz G (2001) Trimmed means for functional data. Test
2nZ
10:419–440
Horváth L, Kokoszka P (2012) Inference for functional data with
Therefore,
applications. Springer, New York
sffiffiffiffiffiffiffiffiffiffiffiffi! Hyndman RJ, Shang HL (2010) Rainbow plots, bagplots, and
1 X nZ
ln 1=d boxplots for functional data. J Comput Graph Stat 19:29–45
Pr Ez1 jYnY ½gðz1 ; YnY Þ  gðzi ; YnY Þ þ
nZ i¼1 2nZ Ignaccolo R, Franco-Villoria M, Fassò A (2015) Modelling colloca-
tion uncertainty of 3D atmospheric profiles. Stoch Environ Res
 1  d: ð15Þ Risk Assess 29:419–429
López-Pintado S, Romo J (2009) On the concept of depth for
However, Theorem 1 provides a probabilistic upper bound functional data. J Am Stat Assoc 104:718–734
for ExjYnY ½gðx; YnY Þ. First, recall that z1  Ymix and note that McDiarmid C (1989) On the method of bounded differences. Survey
in combinatorics. Cambridge University Press, Cambridge,
Eðz1  Ymix ÞjYnY ½gðz1 ; YnY Þ ¼ ð1  aÞEðz1  Ynor ÞjYnY ½gðz1 ; YnY Þ pp 148–188
Menafoglio A, Guadagnini A, Secchi P (2014) A kriging approach
þ aEðz1  Yout ÞjYnY ½gðz1 ; YnY Þ: based on Aitchison geometry for the characterization of particle-

123
1130 Stoch Environ Res Risk Assess (2016) 30:1115–1130

size curves in heterogeneous aquifers. Stoch Environ Res Risk Silverman BW (1986) Density estimation for statistics and data
Assess 28:1835–1851 analysis. Chapman and Hall, London
Ramsay JO, Silverman BW (2005) Functional data analysis. Springer, Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat
New York 20:316–334
Ruiz-Medina MD, Espejo RM (2012) Spatial autoregressive func- Tukey JW (1975) Mathematics and the picturing of data. Proc Int
tional plug-in prediction of ocean surface temperature. Stoch Congr Math 2:523–531
Environ Res Risk Assess 26:335–344
Sguera C, Galeano P, Lillo R (2014) Spatial depth-based classification
for functional data. Test 23:725–750

123

You might also like