Functional Outlier Detection
Functional Outlier Detection
DOI 10.1007/s00477-015-1096-3
ORIGINAL PAPER
123
1116 Stoch Environ Res Risk Assess (2016) 30:1115–1130
400
nitrogen oxides (NOx ) emission daily levels measured in
the Barcelona area (see Febrero et al. 2008 for a first
300
analysis of this data set). Since NOx represent one of the
most important pollutants, cause ozone formation and
contribute to global warning, it is of interest the identifi-
200
cation of days with abnormally large NOx emissions to
allow the implementation of actions able to control their
100
causes, which are primarily the combustion processes
generated by motor vehicles and industries.
0
We propose to detect functional outliers using the notion
0 5 10 15 20
of functional depth. A functional depth is a measure pro-
viding a P-based center-outward ordering criterion for Fig. 1 NOx levels measured in lg=m3 every hour of 76 working days
observations of a functional space H, where P is a proba- between 23/02/2005 and 26/06/2005 in Poblenou, Barcelona
bility distribution on H. When a sample of curves is
available, a functional depth orders the curves from the to appreciate that the presence of partial outliers is an
most to the least central according to their depth values issue.
and, if any outlier is in the sample, its depth is expected to We compare our methods with some alternative outlier
be among the lowest values. Therefore, it is reasonable to detection procedures: Febrero et al. (2008) proposed to
build outlier detection methods that use functional depths. label as outliers those curves with depth values lower than a
In this paper we enlarge the number of available certain threshold. As functional depths, they considered the
functional outlier detection procedures by presenting three Fraiman and Muniz depth (Fraiman and Muniz 2001), the
new methods based on a specific depth, the kernelized h-modal depth (Cuevas et al. 2006) and the integrated dual
functional spatial depth (KFSD, Sguera et al. 2014). depth (Cuevas and Fraiman 2009). To determine the depth
KFSD is a local-oriented depth, that is, a depth which threshold, they proposed two different bootstrap procedures
orders curves looking at narrow neighborhoods and giving based on depth-based trimmed or weighted resampling,
more weight to close than distant curves. Its approach is respectively; Sun and Genton (2011) introduced the func-
opposite to what global-oriented depths do. Indeed, any tional boxplot, which is constructed using the ranking of
global depth makes depend the depth of a given curve on curves provided by the modified band depth (López-Pintado
the whole rest of observations, with equal weights for all and Romo 2009). The proposed functional boxplot detects
of them. This is the case of a global-oriented depth such outliers using a rule that is similar to the one of the standard
as the functional spatial depth (FSD, Chakraborty and boxplot; Hyndman and Shang (2010) proposed to reduce
Chaudhuri 2014), of which KFSD is its local version. A the outlier detection problem from functional to multivari-
local depth such as KFSD may result useful to analyze ate data by means of functional principal component anal-
functional samples having a structure deviating from ysis (FPCA), and to use two alternative multivariate
unimodality or symmetry. Moreover, the local approach techniques on the scores to detect outliers, i.e., the bagplot
behind KFSD proved to be a good strategy in supervised and the high density region boxplot, respectively.
classification problems with groups of curves not extre- The remainder of the article is organized as follows. In
mely clear-cut (see Sguera et al. 2014). Alternatively, we Sect. 2 we recall the definition of KFSD. In Sect. 3 we
illustrate that KFSD ranks well low magnitude, shape or consider the functional outlier detection problem. In Theo-
partial outliers, that is, their corresponding KFSD values rem 1 we present the result on which are based three new
are in general lower than those of normal curves. Then, outlier detection methods which employ KFSD as depth
we propose different procedures to select a threshold for function. In Sect. 4 we report the results of our simulation
KFSD to distinguish between normal curves and outliers. study, whereas in Sect. 5 we perform outlier detection on the
These procedures employ smoothing resampling tech- NOx data set. In Sect. 6 we draw some conclusions. Finally,
niques and are based on a theoretical result which allows in the Appendix we report a sketch of the proof of Theorem 1.
to obtain a probabilistic upper bound on a desired false
alarm probability of detecting normal curves as outliers.
Note that the probabilistic foundations of the proposed 2 The kernelized functional spatial depth
methods represent a novelty in FDA outlier detection
problems. We study the performances of our procedures In functional spaces a depth measure has the purpose of
in a simulation study and analyzing the NOx data set. We measuring the degree of centrality of curves relative to the
show this data set in Fig. 1, where it is already possible distribution of a functional random variable. Various
123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1117
functional depths have been proposed following two /ðxÞ /ðYÞ
alternative approaches: a global approach, which implies KFSDðx; YÞ ¼ 1 E
k/ðxÞ /ðYÞk ; ð3Þ
that the depth of an observation depends equally on all the
observations allowed by P on H, and a local approach, and it can be interpreted as a recoded version of
which instead makes depend the depth of an observation FSD(x, Y) since KFSDðx; YÞ ¼ FSDð/ðxÞ; /ðYÞÞ:
more on close than distant observations. Among the The sample version of (3) is given by
existing global-oriented depths there is the Fraiman and 1X /ðxÞ /ðyi Þ
n
Muniz depth (FMD, Fraiman and Muniz 2001), the random KFSDðx; Yn Þ ¼ 1 :
n i¼1 k/ðxÞ /ðyi Þk
Tukey depth (RTD, Cuesta-Albertos and Nieto-Reyes
2008), the integrated dual depth (IDD, Cuevas and Fraiman Then, standard calculations (see Appendix) and (2) allow
2009), the modified band depth (MBD, López-Pintado and to provide an alternative expression of KFSDðx; Yn Þ, in this
Romo 2009) or the functional spatial depth (FSD, Chak- case in terms of j:
raborty and Chaudhuri 2014). Proposals of local-oriented
depths are instead the h-modal depth (HMD, Cuevas et al.
0 11=2
B C
1B
B X
n
jðx; xÞ þ jðyi ; yj Þ jðx; yi Þ jðx; yj Þ C
C
KFSDðx; Yn Þ ¼ 1 B pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiC ; ð4Þ
nB jðx; xÞ þ jðyi ; yi Þ 2jðx; yi Þ jðx; xÞ þ jðyj ; yj Þ 2jðx; yj ÞC
@ i; j ¼ 1; A
yi 6¼ x; yj 6¼ x
2006) or the kernelized functional spatial depth (KFSD, Note that (4) only requires the choice of j, and not of /,
Sguera et al. 2014). which can be left implicit. As j we use the Gaussian kernel
In this paper we focus on KFSD. Before giving its function given by
definition, we recall the definition of the functional spatial !
depth (FSD, Chakraborty and Chaudhuri 2014). Let H be kx yk2
jðx; yÞ ¼ exp ; ð5Þ
an infinite-dimensional Hilbert space, then for x 2 H and r2
the functional random variable Y 2 H, FSD of x relative to
where x; y 2 H. In turn, (5) depends on the norm function
Y is given by
inherited by the functional Hilbert space where data are
xY assumed to lie, and on the bandwidth r. Regarding r, we
FSDðx; YÞ ¼ 1 E
k x Y k ;
initially consider nine different r, each one equal to 9
different percentiles of the empirical distribution of
where k k is the norm inherited from the usual inner
kyi yj k; yi ; yj 2 Yn . The first percentile is 10 %, and by
product in H. For a n-size random sample of Y, i.e.,
Yn ¼ fy1 ; . . .; yn g, the sample version of FSD has the fol- increments of 10 we obtain the ninth percentile, i.e., 90 %.
lowing form: Note that the lower r, the more local the approach, and
therefore the percentiles that we use cover different degrees
1 X x yi
n of KFSD-based local approaches: strongly (e.g., 20 %),
FSDðx; Yn Þ ¼ 1 : ð1Þ
n i¼1 kx yi k moderately (e.g., 50 %) and weakly (e.g., 80 %) local
approaches. In Sect. 4 we present a method to select r in
As mentioned before, FSD is a global-oriented depth and outlier detection problems.
KFSD is a local version of it. KFSD is obtained writing (1) In general, since any functional depth measures the
in terms of inner products and then replacing the inner degree of centrality or extremality of a given curve relative
product function with a positive definite and stationary to a distribution or a sample, outliers are expected to have
kernel function. This replacement exploits the relationship low depth values. More in particular, in presence of low
magnitude, shape or partial outliers, an approach based on
jðx; yÞ ¼ h/ðxÞ; /ðyÞi; x; y 2 H; ð2Þ
the use of a local depth like KFSD may help in detecting
where j is the kernel j : H H ! R, / is the embedding outliers. To illustrate this fact, we present the following
map / : H ! F and F is a feature space. Indeed, a defi- example: first, we generated 100 data sets of size 50 from a
nition of KFSD in terms of / can be given, that is, mixture of two stochastic processes, one for normal curves
123
1118 Stoch Environ Res Risk Assess (2016) 30:1115–1130
high magnitude contamination than with the best global depths (shape: FSD with 39.06 %;
−2 0 2 4 6 8
shape contamination
−2 0 2 4 6 8
curves and one for outliers, say Ynor and Yout , respectively.
Let Ymix be a mixture, i.e.,
0.0 0.2 0.4 0.6 0.8 1.0 Ynor ; with probability 1 a;
Ymix ¼ ð6Þ
Fig. 2 Examples of contaminated data sets: high magnitude contam-
Yout ; with probability a;
ination (top), shape contamination (middle) and partial contamination
where a 2 ½0; 1 is the contamination probability (usually, a
(bottom). The solid curves are normal curves and the dashed curves
are outliers value rather close to 0). The curves composing Yn are all
unlabeled, and the goal of the analysis is to decide whether
and one for high magnitude outliers, with the probability each curve is a normal curve or an outlier.
that a curve is an outlier equal to 0.05. Second, we gen- KFSD is a functional extension of the kernelized spatial
erated a group of 100 data sets from a mixture which depth for multivariate data (KSD) proposed by Chen et al.
produces shape outliers. Finally, we generated a group of (2009), who also proposed a KSD-based outlier detector
100 data sets from a mixture which produces partial out- that we generalize to KFSD: for a given data set Yn gen-
liers. In Fig. 2 we report a contaminated data set for each erated from Ymix and t 2 ½0; 1, the KFSD-based outlier
mixture. detector for x 2 H is given by
Let nout;j ; j ¼ 1; . . .; 100, be the number of outliers gen-
1; if KFSD ðx; Yn Þ t;
erated in the jth data set. For each data set and functional gðx; Yn Þ ¼ ð7Þ
0; if KFSD ðx; Yn Þ [ t;
depth, it is desirable to assign the nout;j lowest depth values
to the nout;j generated outliers. For each mixture and gen- where t is a threshold which allows to discriminate between
erated data set, we recorded how many times the depth of outliers (i.e., gðx; Yn Þ ¼ 1) and normal curves (i.e.,
an outlier is among the nout;j lowest values. As depth gðx; Yn Þ ¼ 0), and it is a parameter that needs to be set.
functions, we considered five global depths (FMD, RTD, For the multivariate case, KSD-based outlier detection is
IDD, MBD and FSD) and two local depths (HMD and carried under different scenarios. One of them consists in
KFSD). The results reported in Table 1 show that for all an outlier detection problem where two samples are
the functional depths the ranking of high magnitude out- available and the threshold t is selected by controlling the
liers is an easier task than the ranking of shape and partial probability that normal observations are classified as out-
outliers. However, while the ranking of high magnitude liers, i.e., the false alarm probability (FAP). The selection
outliers is reasonably good in different cases, e.g., for the criterion is based on a result providing a KSD-based
local KFSD (94.87 %) and the global RTD (90.17 %), the probabilistic upper bound on the FAP which depends on
ranking of shape and partial outliers is markedly better with t. Then, the threshold for KSD is provided by the maximum
local depths (shape: 86.72 % for KFSD and 85.47 % for value of t such that the upper bound does not exceed a
HMD; partial: 82.03 % for KFSD and 81.25 % for HMD) given desired FAP. We extend this result to KFSD:
High magnitude outliers 86.32 90.17 81.62 69.23 68.80 85.47 94.87
Shape outliers 7.81 33.59 38.67 12.11 39.06 85.94 86.72
Partial outliers 18.75 44.53 34.77 19.14 46.48 81.25 82.03
Types of outliers: high magnitude, shape and partial
123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1119
Theorem 1 Let YnY ¼ fyi ; . . .; ynY g and ZnZ ¼ usually close to 0, that we advise to set equal to r.
fzi ; . . .; znZ g be i. i. d. samples generated from the unknown These least deep curves are deleted from the sample,
mixture of random variables Ymix 2 H described by (6), and simple resampling is carried out with the remain-
with a [ 0. Let gð; YnY Þ be the outlier detector defined in ing curves.
(7). Fix d 2 ð0; 1Þ and suppose that a r for some 3. KFSD-based weighted resampling: once KFSDðyi ; Yn Þ;
r 2 ½0; 1. For a new random element x generated from i ¼ 1; . . .; n are obtained, weighted resampling is
Ynor , the following inequality holds with probability at least carried out with weights wi ¼ KFSDðyi ; Yn Þ.
1 d:
" sffiffiffiffiffiffiffiffiffiffiffiffi# All the above procedures generate samples with some
1 1 XnZ
ln 1=d repeated curves. However, in a preliminary stage of our
ExjYnY ½gðx; YnY Þ gðzi ; YnY Þ þ ;
1 r nZ i¼1 2nZ study we observed that it is preferable to work with ZnZ
composed of non-repeated curves. To obtain such samples,
ð8Þ
we add a common smoothing step to the previous three
where ExjYnY refers to the expected value of x for a given resampling schemes.
YnY . To describe the smoothing step, first recall that each
curve in Yn is in practice observed at a discretized and finite
The proof of Theorem 1 is presented in the Appendix.
set of domain points, and that the sets may differ from one
Recall that the FAP is the probability that a normal
curve to another. For this reason, the estimation of Yn at a
observation x is classified as outlier. For the elements of
common set of m equidistant domain points may be
Theorem 1, PrxjYnY ðgðx; YnY Þ ¼ 1Þ is the FAP. Moreover,
required. Let ðyi ðs1 Þ; . . .; yi ðsm ÞÞ be the observed or esti-
PrxjYnY ðgðx; YnY Þ ¼ 1Þ ¼ ExjYnY ½gðx; YnY Þ: mated m-dimensional equidistant discretized version of yi ,
RYn be the covariance matrix of the discretized form of Yn
Therefore, the probabilistic upper bound of Theorem 1 and c be a smoothing parameter. Consider a zero-mean
applies also to the FAP. Gaussian process whose discretized form has cRYn as
It is worth noting that the application of Theorem 1 covariance matrix. Let ðfðs1 Þ; . . .; fðsm ÞÞ be a discretized
requires to observe two samples, circumstance rather realization of the previous Gaussian process. Consider any
uncommon in classical outlier detection problems, in which of the previous three resampling procedures and assume
usually a single sample generated from an unknown mix- that at the jth trial, j ¼ 1; . . .; nZ , the ith curve in Yn has
ture of random variables is available. For this reason, we been sampled. Then, the discretized form of the jth curve in
propose a solution which allows to use Theorem 1 in
ZnZ would be given by zj ðs1 Þ; . . .; zj ðsm Þ ¼ ðyi ðs1 Þþ
presence of a unique sample. Note that the general idea
fðs1 Þ; . . .; yi ðsm Þ þ fðsm ÞÞ, or, in functional form, by
behind holds also in the multivariate framework, and
zj ¼ yi þ f. Therefore, combining each resampling scheme
therefore it would enable to perform KSD-based outlier
with this smoothing step, we provide three different
detection when only a Rd -sample is available.
approximate ways to obtain ZnZ , and we refer to them as
In the functional context, our solution consists in setting
smo, tri and wei, respectively. Then, for fixed d, r and
YnY ¼ Yn and in obtaining ZnZ by resampling with
desired FAP, the threshold t for (7) is selected as the
replacement from Yn . Note that by doing this, and for
maximum value of t such that the right-hand side of (8)
sufficiently large values of nZ , we also obtain that the effect
does not exceed the desired FAP. Let t be the selected
of d on the probabilistic upper bound drastically reduces.
threshold, which is then used in (7) to compute gðyi ; Yn Þ,
Concerning r, that is the upper bound for the unknown
i ¼ 1; . . .; n. If gðyi ; Yn Þ ¼ 1, yi is detected as outlier. To
contamination probability a, a true range between 0 and 0.1
summarize, we provide three KFSD-based outlier detection
appears to be appropriate to cover most of the situations
procedures and we refer to them as KFSDsmo , KFSDtri and
found in practice. Regarding the resampling procedure to
KFSDwei depending on how ZnZ is obtained (smo, tri and
obtain ZnZ , we consider three different schemes, all of them
wei, respectively; recall that YnY ¼ Yn ). As competitors of
with replacement. Since we deal with potentially contam-
the proposed procedures, we consider the methods men-
inated data sets, besides simple resampling, we also con-
tioned in Sect. 1 that we now describe.
sider two robust KFSD-based resampling procedures
Sun and Genton (2011) proposed a depth-based func-
inspired by the work of Febrero et al. (2008). The three
tional boxplot and an associated outlier detection rule
resampling schemes that we consider are:
based on the ranking of the sample curves that MBD
1. Simple resampling. provides. The ranking is used to define a sample central
2. KFSD-based trimmed resampling: once KFSDðyi ; Yn Þ; region, that is, the smallest band containing at least half of
i ¼ 1; . . .; n are obtained, it is possible to identify the the deepest curves. The non-outlying region is defined
daT e% least deepest curves, for a certain 0\aT \1 inflating the central region by 1.5 times. Curves that do not
123
1120 Stoch Environ Res Risk Assess (2016) 30:1115–1130
123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1121
0:69s 1. FBP when used with FMD, HMD, RTD, IDD, MBD,
yðsÞ ¼ u1 sin s þ exp u4 cos s; ð11Þ FSD and KFSD: regarding FBP, as reported in Sect. 3,
2p
the central region is built considering the 50 % deepest
where u4 is an observation from a continuous uniform curves and the non-outlying region by inflating by 1.5
random variable between 0.1 and 0.15. As MM3, MM6 times the central region. Regarding the depths, for
allows outliers that are normal in the first part of the HMD, we follow the recommendations in Febrero
domain and become outlying with an exponential pattern.
et al. (2008), that is, H is the L2 space, jðx; yÞ ¼
In Fig. 3 we report a simulated data set with at least one
2
outlier for each mixture model. p2ffiffiffiffi exp kxyk2 and h is equal to the 15 % percentile
2p 2h
The details of the simulation study are the following: for
of the empirical distribution of kyi yj k; yi ; yj 2 Yn .
each mixture model, we generated 100 data sets, each one
composed of 50 curves. As mentioned above, for each For RTD and IDD, we work with 50 projections in
single samples Theorem 1 cannot be directly applied, and random Gaussian directions. For MBD, we consider
therefore KFSDsmo , KFSDtri and KFSDwei represent prac- bands defined by two curves. For FSD and KFSD, we
tical alternatives. Two values of the contamination proba- assume that the curves lie in the L2 space. Moreover, in
bility a were considered: 0.02 and 0.05. All curves were KFSD we set r equal to a moderately local percentile
generated using a discretized and finite set of 51 equidistant (50 %) of the empirical distribution of kyi yj k; yi ; yj
points in the domain of each mixture model ([0, 1] for 2 Yn g.
MM1, MM2 and MM3; ½0; 2p for MM4, MM5 and MM6) 2. Btri and Bwei when used with FMD, HMD, RTD, IDD,
and the discretized versions of the functional depths were MBD, FSD and KFSD: c ¼ 0:05, B ¼ 200, aT ¼ a.
used. Regarding the depths, we use the specifications
In relation with the methods and the functional depths reported for FBP.
that we consider in the study, their specifications are 3. FBG: as reported in Sect. 3, the central region is built
described next: considering the 50 % deepest bivariate robust
MM1 MM4
−0.2 −0.1 0.0 0.1 0.2
6
4
2
0
−2
MM2 MM5
−1 0 1 2 3 4 5
MM3 MM6
−0.1 0.0 0.1 0.2
6
4
2
0
−0.2
Fig. 3 Examples of contaminated functional data sets generated by MM1, MM2, MM3, MM4, MM5 and MM6. Solid curves are normal curves
and dashed curves are outliers
123
1122 Stoch Environ Res Risk Assess (2016) 30:1115–1130
6
outlying region by inflating by 2.58 times the central
region.
4
4. FHD: b ¼ a.
5. KFSDsmo , KFSDtri and KFSDwei : nY ¼ n ¼ 50 (since
2
YnY ¼ Yn ), c ¼ 0:05, aT ¼ a (only for KFSDtri ),
nZ ¼ 6n, d ¼ 0:05, r ¼ a, desired FAP = 0.10. More-
0
over, as introduced in Sect. 2, for these methods we
consider 9 percentiles to set r in KFSD. The way in
−2
which we propose to choose the most suitable 0.0 0.2 0.4 0.6 0.8 1.0
percentile for outlier detection is presented below.
Fig. 4 Example of a training sample of peripheral curves for a
In supervised classification, the availability of training contaminated data set generated by MM1 with a ¼ 0:05. The solid
curves with known class memberships makes possible the and shaded curves are the original curves (both normal and outliers).
definition of some natural procedures to set r for KFSD, The dashed curves are the peripheral curves to use as training sample
such as cross-validation. However, in an outlier detection
problem, it is common to have no information whether
curves are normal or outliers. Therefore, training proce- pk 2 fp1 ; . . .; pK g, compute KFSDpk ðyðiÞ;j ; YðiÞ;j Þ, where
dures are not immediately available.
YðiÞ;j ¼ Yn n yðiÞ;j . At the end, a L K matrix is
We propose to overcome this drawback by obtaining a
obtained, say DLK ¼ fdlk gl ¼ 1; . . .L; whose kth column
‘‘training sample of peripheral curves’’, and then choosing
the percentile that ranks better these peripheral curves as k ¼ 1; . . .; K
final percentile for KFSD in KFSDsmo , KFSDtri and is composed of the KFSD values of the L training
KFSDwei . We now describe this procedure, which is based peripheral curves when the kth percentile is employed in
on J replications. Let Yn be the functional data set on which KFSD. Next, let rlk be the rank of dlk in the vector
outlier detection has to be done and let YðnÞ ¼ KFSDpk ðy1 ; Yn Þ; . . .; KFSDpk ðyn ; Yn Þ;dlk Þ, e.g., rlk is equal
yð1Þ ; . . .; yðnÞ be the depth-based ordered version of Yn , to 1 or n þ 1 if dlk is the minimum or the maximum value
where yð1Þ and yðnÞ are the curves with minimum and in the vector, respectively. Let RLK be the result of this
maximum depth, respectively. The steps to obtain a set of transformation of DLK , and sum the elements of each col-
peripheral curves are the following: umn, obtaining a K-dimensional vector, say RK . Since the
goal is to assign ranks as low as possible to the peripheral
I. Let fp1 ; . . .; pK g be the set of percentiles in use (in curves, choose the percentile associated to the minimum
our case, as explained in Sect. 2, pk ¼ ð10kÞ %, value of RK . When a tie is observed, we break it randomly.
k 2 f1; . . .; K ¼ 9gÞ, and choose randomly a per- The comparison among methods is performed in terms
centile from the set. For the jth replication, of both correct and false outlier detection percentages,
j 2 f1; . . .; J g, denote the selected percentile as p j . which are reported in Tables 2, 3, 4, 5, 6 and 7. To ease the
We use J ¼ 20 in the rest of the paper. reading of the tables, for each model and a, we report in
II. Using p j , compute KFSDp j ðyi ; Yn Þ, i ¼ 1; . . .; n, bold the five best correct outlier detection percentages (c).1
where the notation KFSDp j ð; Þ is used to describe For each model, if a method is among the five best ones for
what percentile is used. For the jth replication, both contamination probabilities a, we report its label in
denote the KFSD-based ordered curves as bold.
yð1Þ;j ; . . .; yðnÞ;j . The results in Tables 2, 3, 4, 5, 6 and 7 show that:
III. Take yð1Þ;j ; . . .; yðlj Þ;j , where lj Binðn; 1nÞ. Apply the
1. KFSDtri and KFSDwei are always among the five best
smoothing step described in Sect. 3 to these curves.
methods. KFSDsmo is among the five best methods 10
For the smoothing step, we use RYn and c ¼ 0:05.
times over 12, but when its performance is not among
For the jth replication, denote the peripheral and
the five best, it is neither extremely far from the fifth
smoothed curves as yð1Þ;j ; . . .; yðlj Þ;j .
method (MM2, a ¼ 0:05: 95.18 % against 96.79 %;
IV. Repeat J times steps I–III. to obtain a collection of MM3, a ¼ 0:05: 73.79 % against 78.63 %). The rest of
P
L ¼ Jj¼1 lj peripheral curves, say YL (for an exam- the methods are among the five best procedures at most
ple, see Fig. 4). four times over 12 (FBP ? HMD and Btri ? HMD).
Next, YL acts as training sample according to the fol- 1
In presence of tie, the method with lower false outlier detection
lowing steps: for each yðiÞ;j 2 YL , (i lj ), and percentage (f) is preferred.
123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1123
Table 2 MM1, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier Table 3 MM2, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier
detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo , detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo ,
KFSDtri and KFSDwei KFSDtri and KFSDwei
a ¼ 0:02 a ¼ 0:05 a ¼ 0:02 a ¼ 0:05
c f c f c f c f
FBP ? FMD 44.34 1.23 43.86 0.73 FBP ? FMD 99.09 1.08 96.39 0.84
FBP ? HMD 74.53 0.94 72.81 0.61 FBP ? HMD 96.36 0.96 96.39 0.88
FBP ? RTD 61.32 0.57 63.16 0.31 FBP ? RTD 99.09 0.61 94.78 0.25
FBP ? IDD 55.66 0.61 61.84 0.34 FBP ? IDD 99.09 0.70 95.18 0.38
FBP ? MBD 49.06 1.33 50.44 0.69 FBP ? MBD 99.09 1.06 96.39 0.82
FBP ? FSD 62.26 0.67 61.84 0.40 FBP ? FSD 99.09 0.57 94.78 0.36
FBP ? KFSD 66.04 0.86 74.12 0.44 FBP ? KFSD 98.18 0.63 93.98 0.36
Btri ? FMD 0.00 0.98 0.00 1.82 Btri ? FMD 0.00 1.06 0.00 1.96
Btri ? HMD 66.98 1.45 57.89 1.47 Btri HMD 95.45 1.51 96.79 1.68
Btri ? RTD 10.38 1.78 14.91 1.76 Btri ? RTD 1.82 1.92 6.83 2.61
Btri ? IDD 10.38 1.55 11.84 1.74 Btri ? IDD 5.45 1.60 7.63 1.94
Btri ? MBD 0.00 0.51 0.00 1.49 Btri ? MBD 0.00 0.98 0.40 2.10
Btri ? FSD 2.83 0.76 5.26 1.17 Btri ? FSD 4.55 1.06 5.22 1.62
Btri ? KFSD 70.75 1.43 58.77 1.40 Btri ? KFSD 97.27 1.60 95.18 1.52
Bwei ? FMD 0.00 1.29 0.00 1.49 Bwei ? FMD 0.00 1.27 0.00 1.52
Bwei ? HMD 71.70 1.02 47.37 0.65 Bwei ? HMD 95.45 1.02 86.35 0.36
Bwei ? RTD 13.21 2.04 13.60 1.78 Bwei ? RTD 5.45 2.21 8.43 2.84
Bwei ? IDD 17.92 1.82 10.53 1.55 Bwei ? IDD 7.27 1.49 9.64 2.36
Bwei ? MBD 0.00 1.08 0.00 1.40 Bwei ? MBD 0.00 1.27 0.40 1.49
Bwei ? FSD 2.83 1.39 3.95 1.07 Bwei ? FSD 8.18 1.39 4.02 1.37
Bwei ? KFSD 61.32 0.88 55.26 0.48 Bwei ? KFSD 95.45 0.96 79.52 0.51
FBG 100.00 2.27 97.81 2.37 FBG 8.18 3.07 4.42 2.95
FHD 48.11 1.00 73.68 2.77 FHD 7.27 1.88 12.45 5.66
KFSDsmo 89.62 4.50 85.09 2.58 KFSDsmo 100.00 3.91 95.18 2.76
KFSDtri 89.62 4.92 92.11 4.40 KFSDtri 100.00 5.19 97.99 4.84
KFSDwei 97.17 9.44 96.93 6.54 KFSDwei 100.00 9.20 99.60 6.48
2. Regarding MM5 and MM6, our procedures are clearly percentages are however something expected in
the best options in terms of correct detection (c), and in KFSDsmo , KFSDtri and KFSDwei since these methods
the following order: KFSDwei , KFSDtri and KFSDsmo . are based on the definition of a desired false alarm
In general, this pattern is observed overall the simu- probability, which is equal to 10 % in this study.
lation study. Note that for MM6 and a ¼ 0:02 we Concerning MM2, we observe similar results to MM3,
observe the best relative performances of KFSDsmo , but in this case the performances of the best methods in
KFSDtri and KFSDwei , i.e., 91.58, 93.68 and 96.84 %, terms of correct detection (KFSDsmo , KFSDtri ,
respectively, against 71.58 % of the fourth best method KFSDwei , FBP-based methods and Btri when used with
(Bwei ? KFSD), that is, we observe at least 20 % local depths) are closer to each other.
differences. Finally, there are only two cases in which a competitor
3. About MM3, KFSDwei is clearly the best method in outperforms all our methods, and it is FBAG under
terms of correct detection, however at the price of MM1 and both a. However, this procedure does not
having a greater false detection (f). This is in general show a behavior as stable as KFSDsmo , KFSDtri and
the main weak point of KFSDsmo , KFSDtri and KFSDwei do. Indeed, FBAG shows poor performances
KFSDwei . As for correct detection, we observe a under other models, e.g., MM2.
overall pattern in our methods in false detection, but in In summary, the above results and remarks show that the
an opposite way, indicating therefore a trade-off proposed KFSD-based procedures are the best methods in
between c and f. Relative high false detection detecting outliers for the considered models. Moreover,
123
1124 Stoch Environ Res Risk Assess (2016) 30:1115–1130
Table 4 MM3, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier Table 5 MM4, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier
detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo , detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo ,
KFSDtri and KFSDwei KFSDtri and KFSDwei
a ¼ 0:02 a ¼ 0:05 a ¼ 0:02 a ¼ 0:05
c f c f c f c f
FBP ? FMD 65.69 0.92 49.19 0.97 FBP ? FMD 1.02 0.00 0.00 0.00
FBP 1 HMD 89.22 0.57 85.89 0.63 FBP ? HMD 6.12 0.00 1.60 0.02
FBP ? RTD 86.27 0.45 76.61 0.34 FBP ? RTD 0.00 0.00 0.00 0.00
FBP ? IDD 79.41 0.51 70.56 0.38 FBP ? IDD 0.00 0.00 0.00 0.00
FBP ? MBD 74.51 0.88 59.27 0.84 FBP ? MBD 0.00 0.00 0.00 0.00
FBP ? FSD 79.41 0.51 73.79 0.42 FBP ? FSD 0.00 0.00 0.00 0.00
FBP 1 KFSD 89.22 0.57 83.06 0.59 FBP ? KFSD 2.04 0.00 0.80 0.00
Btri ? FMD 2.94 0.73 5.24 1.22 Btri ? FMD 60.20 0.16 47.60 0.11
Btri ? HMD 57.84 1.57 53.63 1.56 Btri ? HMD 41.84 0.04 18.80 0.17
Btri ? RTD 15.69 1.76 21.37 1.81 Btri ? RTD 54.08 1.16 34.80 0.82
Btri ? IDD 20.59 1.65 20.56 1.70 Btri ? IDD 55.10 1.02 37.20 0.59
Btri ? MBD 0.98 1.06 3.23 1.54 B tri 1 MBD 64.29 0.14 46.40 0.13
Btri ? FSD 16.67 1.14 17.34 1.22 Btri ? FSD 68.37 0.14 45.60 0.08
Btri ? KFSD 57.84 1.63 49.19 1.52 Btri ? KFSD 58.16 0.20 28.00 0.13
Bwei ? FMD 2.94 1.10 3.63 0.84 Bwei ? FMD 51.02 0.12 23.60 0.00
Bwei ? HMD 60.78 1.25 42.74 0.76 Bwei ? HMD 38.78 0.06 10.80 0.02
Bwei ? RTD 15.69 1.92 17.34 1.73 Bwei ? RTD 37.76 0.49 25.20 0.15
Bwei ? IDD 23.53 1.33 14.52 1.22 Bwei ? IDD 43.88 0.67 28.00 0.42
Bwei ? MBD 0.98 1.29 2.82 1.14 Bwei ? MBD 56.12 0.10 25.20 0.02
Bwei ? FSD 15.69 1.16 12.10 0.84 Bwei ? FSD 63.27 0.06 29.20 0.00
Bwei ? KFSD 56.86 1.12 41.53 0.67 Bwei ? KFSD 58.16 0.12 21.20 0.00
FBG 86.27 2.65 78.63 1.73 FBG 9.18 0.53 6.80 1.09
FHD 49.02 1.02 65.73 2.88 FHD 51.02 1.02 37.60 4.34
KFSDsmo 89.22 3.90 73.79 2.95 KFSDsmo 87.76 2.16 50.00 1.24
KFSDtri 90.20 4.63 83.47 4.71 KFSDtri 91.84 3.00 64.80 2.91
KFSDwei 97.06 8.96 90.32 6.50 KFSDwei 95.92 5.08 62.00 3.35
KFSDtri seems the most reasonable choice to balance the mixture models with linear mean functions (MM1, MM2
mentioned trade-off between c and f. In terms of correct and MM3). Finally, the percentiles selected by means of
detection, KFSDwei slightly outperforms KFSDtri , which the proposed training procedure seem to vary among data
however shows very good and stable performances when sets. However, except for MM3 and a ¼ 0:02, at least for
compared with the remaining methods. In terms of false half of the data sets a percentile not greater than the median
detection, KFSDtri considerably improves on KFSDwei , has been chosen, which implies at most a moderately local
especially under some models (e.g., see MM2). approach.
In Fig. 5 we report a series of boxplots summarizing
which percentiles have been selected in the training steps
for KFSDsmo , KFSDtri and KFSDwei , and the following 5 Real data study: nitrogen oxides (NOx ) data
general remarks can be made. First, MM6 is the mixture
model for which lower percentiles have been selected, and Besides simulated data, we consider a real data set which
it is also a scenario in which our methods considerably consists in nitrogen oxides (NOx ) emission level daily
outperform their competitors. The need for a more local curves measured every hour close to an industrial area in
approach for MM6-data may explain the two observed Poblenou (Barcelona) and is available in the R package
facts about this mixture model. Second, lower and more fda.usc (Febrero and Oviedo de la Fuente 2012). Outlier
local percentiles have been chosen for mixture models with detection on this data set was first performed by Febrero
nonlinear mean functions (MM4, MM5 and MM6) than for et al. (2008) where these authors proposed Btri and Bwei .
123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1125
Table 6 MM5, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier Table 7 MM6, a ¼ f0:02; 0:05g. Correct (c) and false (f) outlier
detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo , detection percentages of FBP, Btri , Bwei , FBG, FHD, KFSDsmo ,
KFSDtri and KFSDwei KFSDtri and KFSDwei
a ¼ 0:02 a ¼ 0:05 a ¼ 0:02 a ¼ 0:05
c f c f c f c f
FBP ? FMD 55.56 0.00 54.00 0.00 FBP ? FMD 48.42 0.00 44.19 0.00
FBP ? HMD 66.67 0.00 68.40 0.04 FBP ? HMD 60.00 0.18 62.92 0.00
FBP ? RTD 57.58 0.00 54.40 0.00 FBP ? RTD 55.79 0.00 54.68 0.00
FBP ? IDD 52.53 0.00 56.00 0.00 FBP ? IDD 46.32 0.00 40.07 0.00
FBP ? MBD 55.56 0.00 55.20 0.00 FBP ? MBD 48.42 0.00 45.69 0.00
FBP ? FSD 55.56 0.00 55.60 0.00 FBP ? FSD 52.63 0.00 52.43 0.00
FBP ? KFSD 60.61 0.00 59.20 0.00 FBP ? KFSD 57.89 0.00 56.93 0.00
Btri ? FMD 3.03 0.18 2.80 0.44 Btri ? FMD 29.47 0.22 33.71 0.32
B tri 1 HMD 97.98 0.12 92.40 0.11 Btri ? HMD 71.58 0.24 45.69 0.15
Btri ? RTD 16.16 1.06 20.00 1.03 Btri ? RTD 35.79 0.82 31.09 0.51
Btri ? IDD 18.18 1.06 16.00 1.07 Btri ? IDD 38.95 0.37 35.96 0.74
Btri ? MBD 2.02 0.16 3.20 0.32 Btri ? MBD 29.47 0.24 31.09 0.32
Btri ? FSD 29.29 0.18 27.20 0.23 Btri ? FSD 52.63 0.20 43.82 0.19
Btri ? KFSD 93.94 0.24 92.40 0.21 Btri ? KFSD 71.58 0.22 50.56 0.21
Bwei ? FMD 3.03 0.29 2.40 0.23 Bwei ? FMD 23.16 0.24 19.48 0.08
Bwei ? HMD 93.94 0.08 73.60 0.00 Bwei ? HMD 68.42 0.12 35.96 0.00
Bwei ? RTD 15.15 1.06 17.60 1.12 Bwei ? RTD 38.95 0.69 24.34 0.51
Bwei ? IDD 25.25 0.98 20.00 0.99 Bwei ? IDD 33.68 0.59 25.09 0.40
Bwei ? MBD 2.02 0.20 3.60 0.21 Bwei ? MBD 24.21 0.18 19.85 0.13
Bwei ? FSD 29.29 0.14 21.60 0.13 Bwei ? FSD 47.37 0.16 27.72 0.08
Bwei ? KFSD 83.84 0.08 72.00 0.04 Bwei ? KFSD 66.32 0.12 44.19 0.06
FBG 0.00 1.02 0.40 0.04 FBG 17.89 0.02 14.98 0.06
FHD 4.04 1.96 12.80 5.64 FHD 52.63 1.02 61.80 2.85
KFSDsmo 98.99 1.82 94.00 0.44 KFSDsmo 91.58 2.08 71.16 0.95
KFSDtri 98.99 2.61 98.00 2.11 KFSDtri 93.68 2.69 82.02 2.49
KFSDwei 100.00 4.61 98.40 2.11 KFSDwei 96.84 4.69 83.15 2.75
We carry on their study considering more methods and original data set. First, the W curves have in general higher
depths. values than NW curves, which can be explained by the
NOx are one of the most important pollutants, and it is greater activity of motor vehicles and industries in a city
important to identify outlying trajectories because these like Barcelona during working days. Second, both data sets
curves may compromise any statistical analysis or be of contain curves with peaks, but for W curves the peaks
special interest for further analysis and to implement occur roughly around 7-8 a.m. and during many days,
environmental political countermeasures. The NOx levels whereas for NW curves the peaks occur later and during
that we consider were measured in lg=m3 every hour of few days, which again can be explained by the differences
every day for the period 23/02/2005–26/06/2005. Only for between Barcelona’s economic activity of working and
115 days of the period are available the 24 measurements, nonworking days.
and these are the days that compose the final NOx data set. At first glance, each data set may contain outliers,
Moreover, following Febrero et al. (2008), since the NOx especially partial outliers in the form of abnormal peaks,
data set includes working as well as nonworking days, it and therefore a local depth approach by means of KFSDsmo ,
seems more appropriate to consider a first sample of 76 KFSDtri and KFSDwei appears to be a good strategy to
working day curves (from now on, W) and a second sample detect outliers. Besides them, we do outlier detection with
of 39 nonworking day curves (from now on, NW). Both W all the methods used in Sect. 4. For all the procedures we
and NW are showed in Fig. 6, where it is possible to use the same specifications as in Sect. 4, and we assume
appreciate at least two facts that justify the split of the a ¼ 0:05. For each method, we report the labels of the
123
1126 Stoch Environ Res Risk Assess (2016) 30:1115–1130
percentiles
5
MM1, 0.02
MM1, 0.05
MM2, 0.02
MM2, 0.05
MM3, 0.02
MM3, 0.05
MM4, 0.02
MM4, 0.05
MM5, 0.02
MM5, 0.05
MM6, 0.02
MM6, 0.05
W Table 8 NOx data, Working and Nonworking data sets. Curves
0 100 200 300 400
FBP ? RTD 37 20
FBP ? IDD – 5, 7, 20
FBP ? MBD – –
FBP ? FSD 37 –
0 5 10 15 20 FBP ? KFSD 12, 16, 37 5, 7, 20, 21
Btri ? FMD 16, 37 7
Fig. 6 NOx data: working (top) and non working (bottom) day curves Btri ? HMD 14, 16, 37 7, 20
Btri ? RTD 16 7, 20
curves detected as outliers in Table 8 and we highlight Btri ? IDD 16, 37 7, 20
these curves in Fig. 7. Btri ? MBD 16, 37 7
Btri ? FSD 14, 16, 37 –
Concerning W, most of the methods detect as outlier day Btri ? KFSD 12, 14, 16, 37 7, 20
37, the Friday at the beginning of the long weekend due to Bwei ? FMD 16 7, 20
Labor’s day in 2005 and whose curve shows a partial Bwei ? HMD 16, 37 7, 20
outlying behavior before noon and at the end of the day. Bwei ? RTD 16 –
Another day detected as outlier by many methods is day Bwei ? IDD 16, 37 20
16, another Friday before a long weekend, Easter holidays Bwei ? MBD 16 7
in 2005, and whose curve has the highest morning peak. In Bwei ? FSD 16, 37 –
addition to curves 16 and 37, KFSDsmo detects as outlier Bwei ? KFSD 16, 37 7, 20
curve 14, as other nine methods do, recognizing a seem- FBG 16, 37 –
ingly outlying pattern in early hours of the day. Addi- FHD 12, 14, 16, 37 7, 20
tionally, KFSDtri includes among the outliers also day 12, KFSDsmo 14, 16, 37 7, 20, 21
which may be atypical because of its behavior in early KFSDtri 12, 14, 16, 37 7, 20, 21
afternoon. Note that both day 12 and 14 are in the week KFSDwei 11, 12, 13, 14, 15, 16, 37, 38 7, 20, 21
before the above-mentioned Easter holidays. Finally,
123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1127
400
detected as outliers in Table 8: 11: WE, 09/03/2005
12: FR, 11/03/2005
working (top) and nonworking 13: TU, 15/03/2005
300
14: WE, 16/03/2005
(bottom) days 15: TH, 17/03/2005
16: FR, 18/03/2005
37: FR, 29/04/2005
38: MO, 02/05/2005
200
100
0
0 5 10 15 20
NW
0 5 10 15 20
KFSDwei detects as outliers the greatest number of curves. percentile, and their results partially resemble the ones of
This last result may appear exaggerated, but all the curves the previously mentioned local techniques.
that are outliers according to KFSDwei seem to have some
partial deviations from the majority of curves. For exam-
ple, day 13, whose curve is considered normal by the rest 6 Conclusions
of the procedures, shows a peak at end of the day. Similar
peaks can be observed also in other curves detected as This paper proposes to tackle outlier detection in functional
outliers by other methods (e.g., days 16 and 37), which samples using the kernelized functional spatial depth as a
means that it may be occurring a masking effect to day 13’s tool. In Theorem 1 we presented a probabilistic result
detriment, and only KFSDwei points out this possibly out- allowing to set a KFSD-threshold to identify outliers, but in
lying feature of the curve. Regarding the training step for practice it is necessary to observe two samples to apply
KFSD to set r, it gives as result the 70 % percentile. Theorem 1. To overcome this practical limitation, we
Observing the first graph of Fig. 6, it can be noticed that proposed KFSDsmo , KFSDtri and KFSDwei which are
some curves have a likely outlying behavior, and this may methods that can be applied when a unique functional
be the reason why a weakly local approach for KFSD may sample is available and are based on both a probabilistic
be adequate enough. approach and smoothed resampling techniques.
In the case of NW, some methods detect no curves as We also proposed a new procedure to set the bandwidth
outliers (e.g., all the FSD-based methods), exclusively r of KFSD that is based on obtaining training samples by
three FBP-based methods flag day 5 as outlier, whereas means of smoothed resampling techniques. The general
days 7, 20 and 21 are detected as outliers by our methods as idea behind this procedure can be applied to other func-
well as others. Note that day 7 is the Saturday before Easter tional depths or methods with parameters that need to be
and days 20 and 21 are Labor’s day eve and the same set.
Labor’s day. Days 7 and 20, which have two peaks, at the We investigated the performances of KFSDsmo , KFSDtri
beginning and end of the day, are also flagged by other and KFSDwei by means of a simulation study. We focused
twelve and eight methods, respectively, while day 21, on challenging scenarios with low magnitude, shape and
which shows a single peak in the first hours of the day, is partial outliers instead of high magnitude outliers. The
considered atypical by only two other methods, which results support our proposals. Along the simulation study,
happen to be local (FBP ? HMD and FBP ? KFSD). This KFSDsmo , KFSDtri and KFSDwei attained the largest correct
last result may be connected with what has been observed detection performances in most of the analyzed setups, but
at the KFSD training step for selecting the percentile, i.e., in some cases they paid a price in terms of false detection.
the selection of the 30 % percentile. Therefore, KFSDsmo , However, KFSDsmo , KFSDtri and KFSDwei work with a
KFSDtri and KFSDwei work with a strongly local given desired false alarm probability, and therefore higher
123
1128 Stoch Environ Res Risk Assess (2016) 30:1115–1130
123
Stoch Environ Res Risk Assess (2016) 30:1115–1130 1129
Then, if E½X exists, for any s [ 0 Then, since Eðz1 Ynor ÞjYnY ½gðz1 ; YnY Þ ¼ ExjYnY ½gð x; YnY Þ, for
! a [ 0,
2s2
Prð X E½X sÞ exp Pn 2 : 1
j¼1 cj ExjYnY ½gð x; YnY Þ Eðz Y ÞjY ½gðz1 ; YnY Þ: ð16Þ
1 a 1 mix nY
In order to apply Lemma 1 to our problem, define Consequently, combining (15) and (16), and for r a, we
obtain
1 X nZ
sffiffiffiffiffiffiffiffiffiffiffiffi#!
Xðz1 ; . . .; znZ Þ ¼ gðzi ; YnY jYnY Þ; ð13Þ "
nZ i¼1 1 1X nZ
ln 1=d
Pr ExjYnY ½gðx; YnY Þ gðzi ; YnY Þ þ
1 r nZ i¼1 2nZ
whose expected value is given by
" # 1 d;
1 X nZ
E½X ¼ Ezi jYnY gðzi ; YnY jYnY Þ
nZ i¼1 which completes the proof. h
¼ Ez1 jYnY ½gðz1 ; YnY jYnY Þ: ð14Þ
123
1130 Stoch Environ Res Risk Assess (2016) 30:1115–1130
size curves in heterogeneous aquifers. Stoch Environ Res Risk Silverman BW (1986) Density estimation for statistics and data
Assess 28:1835–1851 analysis. Chapman and Hall, London
Ramsay JO, Silverman BW (2005) Functional data analysis. Springer, Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat
New York 20:316–334
Ruiz-Medina MD, Espejo RM (2012) Spatial autoregressive func- Tukey JW (1975) Mathematics and the picturing of data. Proc Int
tional plug-in prediction of ocean surface temperature. Stoch Congr Math 2:523–531
Environ Res Risk Assess 26:335–344
Sguera C, Galeano P, Lillo R (2014) Spatial depth-based classification
for functional data. Test 23:725–750
123