Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
18 views

Robust Principal Component Functional Logistic Regression

The document discusses robust estimation methods for functional logistic regression models in the presence of outliers. It proposes using principal component analysis to reduce the functional data and parameter estimates to a multiple logistic regression framework. Simulation results and an application to weather data are presented to illustrate the performance of the proposed robust estimator.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Robust Principal Component Functional Logistic Regression

The document discusses robust estimation methods for functional logistic regression models in the presence of outliers. It proposes using principal component analysis to reduce the functional data and parameter estimates to a multiple logistic regression framework. Simulation results and an application to weather data are presented to illustrate the performance of the proposed robust estimator.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/259743017

Robust Principal Component Functional Logistic Regression

Article  in  Communication in Statistics- Simulation and Computation · November 2013


DOI: 10.1080/03610918.2013.861628

CITATIONS READS
4 302

2 authors:

Melody Denhere Nedret Billor


University of Mary Washington Auburn University
3 PUBLICATIONS   8 CITATIONS    49 PUBLICATIONS   956 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Genome Wide Association Analysis View project

Women in Industrial and System Engineering View project

All content following this page was uploaded by Nedret Billor on 26 March 2014.

The user has requested enhancement of the downloaded file.


ROBUST PRINCIPAL COMPONENT FUNCTIONAL LOGISTIC REGRESSION

MELODY DENHERE 1 AND NEDRET BILLOR

Department of Mathematics and Statistics,


Auburn University, 221 Parker Hall
Auburn AL 36849 USA

Abstract. In this paper, we discuss the estimation of the parameter function for a functional
logistic regression model in the presence of outliers. We consider ways that allow for the parameter
estimator to be resistant to outliers, in addition to minimizing multicollinearity and reducing the
high dimensionality which is inherent with functional data. To achieve this, the functional covariates
and functional parameter of the model are approximated in a finite dimensional space generated by
an appropriate basis. This approach reduces the functional model to a standard multiple logistic
model with highly collinear covariates and potential high dimensionality issues. The proposed
estimator tackles these issues and also minimizes the effect of functional outliers. Results from a
simulation study and a real world example are also presented to illustrate the performance of the
proposed estimator.

Key Words: functional logistic regression, robust methods, functional data, outliers.

1. Introduction

In more recent years, a substantial amount of attention has been drawn to functional data
analysis, resulting in the development and generalization of many statistical techniques to this type
of data. Much work has been devoted to this field with publications on functional regression models
forthcoming from James (16), Cardot et al. (4); Müller and StadtMüller (19); Escabias et al. (5);
Ferraty and Vieu (8) to name just a few. Ramsay and Silverman (22) present functional data ideas
and statistical methods in their book which gave impetus to the functional data analysis community.
Interest has been from a broad spectrum of fields such as biometrics, genetics, e-commerce and

E-mail address: mbd0002@auburn.edu.


1
To whom correspondence should be addressed.
1
computer science. Statistical tools, models and methods, whose strength is in recognizing this
structural aspect of data have been discussed; ranging from functional linear regression, functional
ANOVA, functional principal component analysis and functional outlier detection. In particular,
functional regression methods have resulted in a re-look at some of the ways used to analyse
longitudinal data.
Different approaches have been developed in the estimation methods of the functional parameters
of the functional logistic model. Escabias et al. (5) discuss two approaches of parameter estima-
tion that employ principal component estimation. James (16) and Müller and StadtMüller (19)
discuss the generalized functional linear model and consider estimation methods for its parameters.
Goldsmith et al. (12) and Ogden and Reiss (20) also discuss penalization estimation techniques
for the generalized functional linear model that penalize roughness. These estimation techniques,
however, are not resistant to outliers and thus there is a need for some robust estimation methods.
We extend the principal component expansion ideas from Escabias et al. (5) so that the effect of
outlying curves is minimized in the estimation of the functional parameter.
We focus on functional logistic regression in particular as there are some interesting problems
in functional data analysis (FDA) that could benefit from this model. In addition to modeling
functional covariates with binary responses, functional logistic regression is also useful for classi-
fication methods. An increasing amount of research is focusing on functional logistic regression
and its application to functional data. Ratcliffe et al. (24), Reiss et al. (25), Escabias et al. (6),
Leng and Müller (18) and Tian (27) are some examples of work that deal with applications of
the functional logistic model in a variety of areas. Inherent with most functional data , and the
estimation techniques discussed in this work, is the fact that we are dealing with highly correlated
data, which poses problems in parameter estimation resulting in estimations that might be far
from accurate. The presence of functional observations that deviate from the overall pattern of
the data, creates additional inefficiencies in many procedures. In this paper, we present a robust
estimation approach for functional data with a binary response and functional covariates that takes
care of these inadequacies. To our knowledge, there is no work that has been done in this area
of robust estimation for the functional logistic regression model. There are some robust methods
proposed in the functional linear regression model and functional principal component analysis.
Boente and Fraiman (3), Gervini (10), (11), Bali et al. (2) and Sawant et al. (26) are some of such

2
work. This paper is different from others in that not much discussion exists for robust methods in
the functional logistic regression model. However, functional outliers are inevitable and therefore
it is important that robust techniques for this model be developed.
The first approach taken in this paper to develop robust estimators is that of reducing the
functional logistic model to a multiple logistic one by approximating the functional covariates as
a linear combination of an appropriate basis as discussed in Ramsay and Silverman (22). This
estimation approach of the functional data results in collinear and (potentially) highly dimensional
data and an initial focus is thus, elimination of such issues. Some of the approaches in literature to
eliminate these issues are by way of principal component estimation or ridge estimation. However,
in the presence of atypical curves, these estimations are unstable and inaccurate. Febrero et al.
(7) define a functional outlier as that curve that has been generated by a stochastic process with
a different distribution than the rest of curves, which are assumed to be identically distributed.
This definition appears to be all-encompassing as it refers to those curves that could be far away
from most of the curves; curves that have a different pattern from the rest or even those curves
that are atypical in some sub-interval of the period of interest. This makes it necessary to develop
robust estimators, of which we consider robust estimation techniques using principal component
estimation.
This article is organized as follows. In section 2, we provide the formulation of the functional
logistic regression based on a multivariate approach and the details of the robust approach that we
propose for the parameter estimation of the functional logistic model. In section 3, a detailed sim-
ulation study is provided which compares the proposed robust approach with some of the available
methods for this model. We also apply our method to the Canadian Weather data to illustrate its
robustness properties. The last section concludes with an overall discussion.

2. Robust Principal Component Estimation

2.1. Functional Logistic Regression Model.

We consider having functional covariates as Xi (t), i = 1, 2, ..., n where t ∈ T and T is the support
of the covariates; and a random sample of a binary response variable Y to be Yi ∈ {0, 1}, i =
3
1, 2, ..., n. Then the random variable Y is such that,

Yi ∼ Bernoulli(πi )

where πi is the probability of a positive response given Xi (t) which is given as,

πi = P (Y = 1 | Xi (t) : t ∈ T )
R
exp{β0 + T Xi (t)β(t)dt}
= R , i = 1, ..., n,
1 + exp{β0 + T Xi (t)β(t)dt}

with β0 being a real parameter; β(t) a smooth function of t, of which both are unknown parameters.
The logit transformations is,

πi
li = log{ }
1 − πi
Z
= β0 + Xi (t)β(t)dt, i = 1, ..., n. (2.1)
T

In practice, we do not in actuality observe the functional covariates, Xi (t), as these are discrete
observations made at a finite set of time points, i.e. Xi (tik ), i = 1, 2, ..., n; k = 0, 1, ...ni . Due to this
fact, it could be suggested that the functional logistic model be considered as a summation over
the observed time points. However, problems that arise from this include potentially fitting a high
dimensional design matrix of coefficients; and possible inconsistencies in the way that observations
measured at different time points are handled. In essence, estimation of the parameters cannot
be achieved by usual methods of maximum likelihood, resulting in a different approach for these
functional data.

2.2. Estimation of X(t) and β(t).

We consider the functional covariates, Xi (t), and functional parameter, β(t), as belonging to a
finite-dimensional space generated by the same (not necessarily) basis. We consider Xi (t) ∈ L2 (T )
of squared integrable functions with the inner product,
Z
hf, giu = f (t)g(t)dt, ∀f, g ∈ L2 (T ),
τ
4
such that
KX
X
Xi (t) = cij φj (t), (2.2)
j=1

where φj (t), j = 1, ..., KX , is an appropriate basis, selected to reflect the characteristics of the
data. It is important to note that the truncation lag, KX , is a parameter that is selected based on
the features and characteristics of the data. This determines the dimension of expansion; the larger
KX is, the better the fit to the data is. However, this can result in problems of overfitting and
therefore capturing noise or variation in the data that might be ignored. Smaller KX on the other
hand, whilst being desirable should not too small such that there is a risk of overlooking important
features of the data.
The selection of the basis system is an important aspect in functional data in that the features
that might be evident in the observed data should be adequately met by the selection of an ap-
propriate basis system. A good basis system selection potentially results in a smaller KX which
means less computational time in estimating the functional covariate and ensuring that {cij } serves
as interesting descriptors of the given data. In general, the Fourier basis functions are used to
model periodic data and the B-spline basis is used for non-periodic data. Other basis choices such
as wavelets, trigonometric functions or even polynomial functions can also be used should that be
appropriate. Thus, with the assumption that Xi (t) belongs to the space of squared integrable func-
tions, we are able to reconstruct the functional form of the functional covariates from the observed
discrete points using two different approaches.
In the event that the functional covariate is observed with error, then the ith subject at the k th
replication is

Xik = Xi (tik ) + εk , k = 0, 1, .., ni .

In this case, where the functional covariate is observed with some noise, we use some least squares
approximation approach to obtain the functional form of the covariates by approximating the basis
coefficient {cij } from the discrete observations. Alternatively, if the functional covariate is observed
without error, then the ith subject at the k th replication is

Xik = Xi (tik ), k = 0, 1..., ni .

5
In this case, some interpolation method such as the natural cubic spline can be used to get the
functional form of the predictors. In either case, we are able to approximate the functional form of
the sample curves by smoothing or interpolating.
We also define the functional parameter,
Kb
X
β(t) = bk ϕk (t), (2.3)
k=1

where ϕk (t), k = 1, ..., Kb is a basis function; and KX ≥ Kb .


Since estimates for the basis coefficient {cij } can be found either by smoothing or interpolation,
using the basis expansion of the covariates and parameter function as defined in (2.2) and (2.3) in
the regression model (2.1), the functional model becomes a standard multiple one,
Z
li = β0 + Xi (t)β(t)dt
T
Kb
KX X
X
= β0 + cij ψjk bk ,
j=1 k=1

R 0
where ψjk = T φj (t)ϕk (t)dt ; j = 1, ..., KX , k = 1, ..., Kb ; cij is the basis coefficient; β0 is an
unknown real parameter; bk is the unknown basis coefficient used to estimate the parameter function
β(t). In matrix form, this can be written as,

L = β0 1 + Cψb, (2.4)

0 0 0
where L= (l1 , ..., ln ) , C= {cij }n×KX , ψ = {ψjk }KX ×Kb , 1= (1, ..., 1) and b = (b1 , ..., bKb ) .
We take note that the estimation of the parameter function obtained using the maximum like-
lihood approach is not very accurate in the presence of highly correlated data. In fact, due to
the formulation of the design matrix in (2.4), there exists a high correlation in the columns of the
design matrix. Thus, there is a need to eliminate multicollinearity in order to obtain more reliable
estimations for the parameter function. One such approach, as discussed by Escabias et al. (5), is
the use of principal component estimation. Another widely-adopted approach in the standard lo-
gistic model is that of penalized maximum likelihood estimation which include the ridge estimator,
Le Cessie and van Houwelingen (17). We develop the principal component estimation technique
further to cater for cases where there are functional outliers in the data. Due to the fact that
principal component estimation makes use of the eigen decomposition of the covariance matrix of
6
the design matrix, Cψ, the presence of outliers will greatly influence the PCs. This sensitivity to
outliers results in the first few PCs being attracted towards the outliers, and therefore this approach
might miss the main modes of variability of the rest of the observations.

2.3. Robust Principal Component Estimation.

Our proposed approach uses robust PC estimation techniques that eliminate multicollinearity
and reduces the effect of functional outliers, resulting in a more accurate estimator in the presence
of outliers. We use robust PCA methods on the covariate matrix to obtain robust PCs which are
used as the covariate matrix in the standard multiple logistic model. Robust Principal Component
Analysis, ROBPCA, Hubert et al. (15), is one such approach which uses projection pursuit and
robust covariance estimation based on the Minimum Covariance Determinant (MCD) method which
is based on seeking an h-subset of observations whose classical covariance matrix has the smallest
determinant. The three basic steps in ROBPCA are given in the following algorithm:

Input: Data matrix Cn×K where n is the number of observations and K represents the
initial number of variables.
Output: Robust PC scores Zn×p where p < K is the number of eigenvectors retained
(1) A singular value decomposition (SVD) of the data is performed so as to project the obser-
vations on the space spanned by themselves. This step is especially useful when K ≥ n as
it yields a huge dimension reduction.
(2) A measure of outlyingness is defined for each data point. This is achieved by projecting
all the points onto many univariate directions, v, and then determining the standardized
distance of each projected point to the center of the data. The h(< n) least outlying data
points are determined and retained, where outlyingness is defined as,
0
| ci v − µ̂r |
Out(ci ) = maxv , i = 1, ..., n,
σ̂r

where µ̂r and σ̂r are the univariate MCD based location and scale estimates for the projected
0
data points, ci v, respectively. The h data points are projected on the subspace spanned by
the first p eigenvectors of the sample covariance matrix of the h-subset.

(3) The covariance matrix of the mean-centered data matrix, Cn×K , obtained in the second
step using the MCD estimator is robustly estimated and PCA is applied on to this.
7
We consider the functional logistic regression model as defined before in (2.1) where the functional
covariate is defined as (2.2), resulting in the standard multiple logistic regression (2.4). We let
Z(r) = {ξij }n×K be the matrix of robust PCs of the design matrix, s.t.

Z(r) = AV(r) ,

where A = Cψ is the design matrix, V(r) is a KX × KX matrix whose columns are the eigenvectors
associated with the eigenvalues of the robust covariance estimation based on the MCD of A. The
logit transformations of the functional logistic model becomes,

(p) (p) (p)


L(r) = β0 1 + Z(r) γ (p) , (2.5)

0
where γ = V(r) b and p is the number of retained PCs in the model.
The design matrix of is now void of collinear columns, and also because of the robust approach
taken to compute the PCs, the effect of outlying curves is minimized. Therefore, the estimate of
the functional parameter is given by,

0
β̂ = b̂ ϕ,

where b̂ = V(r) γ̂.


There are different criteria used in deciding which PCs should be included in the model. The
natural order is to include PCs based on the explained variability. There are other criterion available
as discussed by Hocking (13) and Müller and StadtMüller (19) which take into consideration the
predictive ability of the PCs. In the simulation study carried out in the next section, the simplistic
natural order of explained variability is used to decide which PCs to include in the model.
Another issue in model selection is the decision concerning the number of PCs to include in the
model. Some of the measures that can be used include the integrated mean squared error of the
beta parameter function (IMSEB). This is defined as,
Z
(p) 1
IM SEB = (β(t) − β̂ (p) (t))2 dt,
T T

where β̂ (p) (t) is the estimated parameter function for the logistic model with p PCs.

8
Another available measure is the mean squared error of beta parameters (MSEB), which is
defined as,
Kb
!
1 (p)
X (p)
M SEB (p) = (β0 − β̂0 )2 + (bk − b̂k )2 ,
Kb + 1
k=1

(p)
where Kb is the number of basis functions, β̂0 is the estimated intercept in the model with p
(p)
PCs and b̂k is the estimated parameter in the standard multiple logistic model with p PCs. The
optimal model is selected as that model whose IMSEB or MSEB is smallest. In the simulation
study, we use the MSEB to determine p, the number of PCs to include in the model. The PCs are
then added based on the explained variability of each PC, starting with the largest variability until
the optimal number of PCs is attained.
However, in the case of real data, both these methods cannot be obtained and therefore more
practical approaches are required. Escabias et al. (5) suggest use of the estimated variance of the
estimated parameters which is defined as,

0 0
ˆ β̂ (p) ) = V (p) (Z (p) W(p) Z(p) )−1 V(p) ,
var(

where W(p) = diag(π̂ (p) (1 − π̂ (p) )), Z(p) is the matrix of p robust PC scores from the design matrix
and V(p) is the matrix whose columns are the eigenvectors associated with the eigenvalues of the
robust covariance estimation based on the MCD of the design matrix. The optimum number of
PCs are selected by plotting the estimated variance against the number of PCs, p, and selecting p
just before a significant increase in the estimated variance. In the case where there are potentially
multiple cases like this, the smallest p is then selected. The percent of variance explained (PVE) is
an alternative approach to determining the number of PCs to retain in the model. Alternatively,
PCs with eigenvalues greater than 1.0 can be retained. In both instances, the idea is to ensure that
much of the variability in the model is retained.
Another method is the cross validation (CV) method. Cross validation involves partitioning the
data into two sets; the first set known as the training set is used to determine a predictive model
whilst the second set, known as the test set is used to validate the predictive model .The leave-
one out cross validation method leaves out one observation and fits the model with the remaining
n − 1 observations. Prediction is then made for the left-out observation using this model, and this

9
procedure is repeated for all the observations. For the logistic model, this is defined as,
n
1X (p)
CV (p) = (yi − π̂i,−i )2 ,
n
i=1

where π̂i,−i indicates the predicted response with observation i missing from the predictive model.
An optimal number of PCs are selected as those with a minimum CV. The Information Criterion
(IC) method is another alternative. The IC can be viewed as a compromise between the goodness
of fit and the complexity of the model. The Akaike Information Criterion (AIC) and Bayesian
Information Criterion (BIC) are some of the widely used IC in which case to select the optimal
number of PCs, the IC is computed for varying p values and the optimal p would then be chosen
at the minimum.

3. Numerical Examples

In this section, we study the performance of our proposed estimation approach by way of a
simulation study as well as applying the methodology to the Canadian Weather data. We show
the improved accuracy in the parameter function estimation in the case where outliers are present.

3.1. Simulation Study.

The following steps were carried out in this simulation study in order to investigate how the pro-
posed estimation technique performs and compares to other existing approaches in the estimation
of the parameter function of the functional logistic regression model.
The first step was to generate the data. The following steps were taken to do this,

(a) Generate n = 50 sample curves for the functional logistic model

We generate 50 sample functional observations of a known stochastic process X(·) considered


over the interval [0, 10] which has 21 equally spaced time slots. We define this process as,

Xi (t) = ai1 + ai2 t + W i(t)


10
X 2π 2π
Wi (t) = bi1 sin( rt) + bi2 cos( rt)
10 10
r=1
10
Figure 1. 50 sample paths generated from the stochastic process X(t) and differ-
entiated for each class of the response

where ai1 ∼ U [1, 4] or ai1 ∼ U [2, 4], ai2 ∼ N [1, 0.2] or ai2 ∼ N [1, 0.6], bi1 , bi2 ∼ N [0, 1/r2 ].
This model is illustrated in Figure 1. Since the response is binary, the sample curves are
differentiated for the two classes. To obtain the functional form of these sample curves, we
consider them to belong to the finite space generated by some basis, where φ(t) is a cubic
B-spline basis defined on equally spaced nodes on the [0, 10] interval.
We used the generalized cross-validation (GCV) method to determine the number of basis
functions to use as well as to determine the smoothing parameter. The criterion is expressed
as,
  
n SSE
GCV (λ) =
n − df (λ) n − df (λ)

where df is the equivalent degrees of measure; SSE is the residual sum of squares; and λ is
the smoothing parameter that measures the rate of exchange between the fit to the data. The
minimization of GCV with respect to λ is achieved by doing a grid search of some values of λ.
(b) The natural cubic spline interpolation of the parameter function sin(t + π/4) is selected. The
0
basis coefficients, b = (b1 , ..., bKb ) of the parameter function are known and used to assess
estimation techniques presented in this paper.
11
(c) The probabilities of a positive response (Y = 1) given Xi (t) are,

exp{li }
πi = ,
1 + exp{li }

where li is as defined in (2.1), β0 is fixed at 0 and i = 1, .., n. The n values of the response are
obtained by simulating observations of a Bernoulli distribution with probabilities πi .

The second step involves contamination of the simulated data. We adopted the contamination
process as discussed by Fraiman and Muniz (9):

Model 0: No Contamination. X(t) = a1 + a2 t + W (t) was the generated data as discussed


before.
Model 1: Asymmetric Contamination. Z(t) = X(t) + cM where c is 1 with probability q
and 0 with probability 1 − q and q = {0%, 5%, 10%, 15%, 20%}; M is the contamination
constant size taking a value of 25 and X(t) is as defined in model 0.
Model 2: Symmetric Contamination. Z(t) = X(t) + cσM where X(t), c and M are as
defined before and σ is a sequence of random variables independent of c that takes the
values 1 and −1 with probability 0.5
Model 3: Partial Contamination. Z(t) = X(t) + cσM if t > T and Z(t) = X(t) if t < T ,
where T is a random number generated from a uniform distribution on [0, 10]
Model 4: Peak Contamination. Z(t) = X(t) + cσM if T ≤ t ≤ T + l and Z(t) = X(t)
if t ∈
/ [T, T + l] where l = 2 and T is a random number from a uniform distribution on
[0, 10 − l].

The effects of these different types of contamination are shown in Figure 2. The logistic model
in (2.4) was fit with the coefficient matrix as the covariate. The Hosmer-Lemeshow (14) goodness
of fit test was also carried out, which indicated the model to be valid for the generated data. In the
final step, we obtained the approximated estimations for the parameter function using the robust
approach discussed (RPCA). The approximated estimates are compared with maximum likelihood
estimate (MLE) and principal component estimate (CPCA) as discussed by Escabias et al. (5). The
MSEB was used to determine how many PCs to retain in the model, and small values of MSEB
also indicate better estimation when these three different estimation techniques are compared. All
simulations were implemented using R (21). The packages fda (23) and rrcov (28) were particularly
12
(a) (b)

(c) (d)

Figure 2. Sampling curves for Model 1 - 4 at q = 5%

useful in obtaining the functional form of the data and performing principal component estimation,
respectively.
The simulations were replicated 200 times and Tables 1 and 2 summarize the effect that con-
tamination has on the median MSEB of the models. The median was used to reduce the influence
of the extreme observations, especially in the case of the maximum likelihood estimator. In most
of the models, especially for models 3 & 4, the robust approach yields better results at the varying
contamination levels. Higher contamination levels (i.e. q = 15%, 20%) were also attempted with
similar consequences, however, high contamination levels for the logistic regression model make it
difficult to distinguish between the contaminated and the simulated data. The estimates for the
ML approach were unstable and inefficient as expected, and the median MSEB values are excluded
from the tables.
13
(a) Model 0 (b) Model 1

(c) Model 2 (d) Model 3

(e) Model 4

Figure 3. Overlay comparison on the effect of outlier curves on classical PCR and
robust PCR when the first 4 PCs are included in the model at 5% contamination

Figure 3 shows the median estimates for the parameter function for the two PC estimation
techniques. In this instance, 4 PCs were retained in the regression models and the median β(t)
estimates compared for the robust and classical PCA approaches. 4 PCs were retained as this was
the typical (median) number of PCs retained for the optimal model for both methods and for the
14
Table 1. Median MSEB (standard error) for the estimation of the functional pa-
rameter based on the optimum model for Model 1 and Model 2

Asymmetric Symmetric
Cont. (%) CPCA RPCA CPCA RPCA
0 0.1903 0.1935 0.1903 0.1935
(0.1525) (0.1537) (0.1525) (0.1537)
5 0.1970 0.1731 0.1995 0.1615
(0.1479) (0.1437) (0.1363) (0.1372)
10 0.1802 0.1780 0.1844 0.1728
(0.1316) (0.1240) (0.1248) (0.1254)
15 0.1795 0.1862 0.1798 0.1768
(0.1215) (0.1210) (0.1181) (0.1198)
20 0.1854 0.1824 0.1815 0.1769
0.1181) (0.1189) (0.1133) (0.1164)

Table 2. Median MSEB (standard error) for the estimation of the functional pa-
rameter based on the optimum model for Model 3 and Model 4

Partial Peak
Cont. (%) CPCA RPCA CPCA RPCA
0 0.1790 0.1839 0.1790 0.1839
(0.1543) (0.2566) (0.1543) (0.2566)
5 0.2929 0.2152 0.2866 0.2263
(0.1444) (0.1395) (0.1477) (0.1387)
10 0.2833 0.2416 0.3059 0.2554
(0.1463) (0.1358) (0.1417) (0.1372)
15 0.2806 0.2504 0.2879 0.2594
(0.1458) (0.1414) (0.1437) (0.1427)
20 0.2779 0.2787 0.2878 0.3070
(0.1452) (0.1477) (0.1422) (0.1386)

different contamination models. The effects of contamination are evident in that when there are no
functional outliers (Model 0), the β(t) estimates are no different as shown in Figure 3(a). However,
the estimates deteriorate for the classical PCA estimation techniques at 5% contamination. This
is especially so with the partial and peak contamination models. In this overlay comparison of the
robust method vs. the non-robust approach, it can be seen that the robust estimation remains
closer to the true simulated curve when compared with the non-robust approach of PC estimation.
Therefore, we obtain better parameter estimation results by making use of the proposed method
when there are outliers present in the data.

3.2. Canadian Weather Data.

In their paper on modeling environmental data by functional principal component estimation,


15
Escabias et al. (6) used the Canadian weather data from Ramsay and Silverman (22) to predict
the risk of drought (Y = 1 for a station where there is no drought risk; Y = 0 for a station where
there is drought risk) based on the monthly average temperature recorded over 12 months. The
annual precipitations for each area were used to determine whether an area had risk of drought or
not. An area is said to have drought risk if the precipitations along a year in that area are lower
than the 25th percentile of the total annual precipitations in the entire country.
There are 23 samples representing the weather stations, each with 12 mean monthly temperatures
recorded. Figure 4 shows the sample curves for the 23 weather stations with indication of the
drought risk for each station. From this dataset, n1 = 9 stations have drought risk and the rest, n2 =
14, do not have drought risk. Due to the sinusoidal nature of the sample curves, the Fourier basis
was used in the approximation of the temperature function for each of the weather stations. The
cross-validation method was used to determine the order of expansion for the functional covariate
of which 11 basis functions are used with a smoothing parameter, λ = 0.0009765625.

Figure 4. The mean monthly temperature for 23 Canadian weather stations used
to predict the risk of drought

We introduce an outlying sample curve by reducing all the temperatures for the Churchill station
by 10 degrees Celsius and slightly changing the pattern of the temperature curve by stretching it
by a factor of 0.675. Figure 5 illustrates the effect of that shift and stretch on the original data.
The AIC was used in order to determine how many PCs to retain in the principal component-based
16
(a) Before (b) After

Figure 5. Churchill weather station’s temperatures are altered

methods. Table (3) gives the details of the retained PCs for the classical principal component
analysis (CPCA) approach as well as our proposed robust principal component analysis (RPCA).
In both cases, the first three PCs were retained and these models have an AIC of 8. There is little
difference in the percent of variation explained (PVE) by the inclusion of these PCs in the logistic
model, all of the models having over 99% of variance explained.

Table 3. The model details for each estimation technique

Original Sample Contaminated Sample


Retained PCs AIC PVE Retained PCs AIC PVE
CPCA 1,2,3 8 99.62% 1,2,3 8 99.63%
RPCA 1,2,3 8 99.70% 1,2,3 8 99.72%

Due to multicollinearity issues, the maximum likelihood approach was not ideal in estimating
the parameter function β(t) of the functional logistic regression model for predicting the risk of
drought. Figure 6 shows the parameter estimate using maximum likelihood estimation. Figure
7 shows the function parameter estimates of β(t) using the two different approaches of principal
component estimation. In the absence of any outlying curves, the non-robust and robust principal
component estimations for the parameter function are almost similar. The effect of contamination
on the estimation of β(t) is quite noticeable for the non-robust approach. The parameter estimate
is distinctly different, and therefore one would have a different interpretation when calculating the
17
Figure 6. Parameter estimation without using principal component estimation

odds of drought for certain seasons or time intervals. On the other hand, the presence of this
outlying sample curve has minimal effect on the proposed robust approach.

(a) Before (b) After

Figure 7. Parameter estimation when Churchill weather station’s temperatures


are shifted and its effect on the principal component estimation methods

Goodness of fit measures were conducted for the three different approaches as summarized in
Table (4). The measures provided are the area under the ROC curve (AUC) as well as the goodness
of fit statistic (Z) and its p-value. For the goodness-of-fit test, the Le Cessie-van Houwelingen normal
18
test statistic for the unweighted sum of squared errors is used. This is defined as,
n 2
X r̂si
T̂r = 2 ),
i=1
var(r̂si

√ yj −π̂j
Pn
where r̂si = j=1 wij { π̂j (1−π̂j )
} and the wij ’s are weights. All three approaches provide good fits
(p-value > 0.05). The robust PCA approach has the highest area under the ROC curves, especially
in the presence of the outlying sample curve. This is considered as excellent discrimination by
Hosmer and Lemeshow (14), an indication that the model based on the robust PCs better predicts
the risk of drought. These goodness-of-fit measures indicate that the model that uses the robust
approach has AUC of 0.7125 whilst that which uses the non-robust approach has one of 0.6755.

Table 4. Goodness of fit measures

Original Sample Contaminated Sample


Method AUC Z p-value AUC Z p-value
MLE 0.6476 -0.8869 0.3751 0.6672 -0.8152 0.4149
Classical PCA 0.6508 -0.6428 0.5204 0.6481 -1.0026 0.3161
Robust PCA 0.6587 -0.5862 0.5773 0.7125 -0.5259 0.5989

4. Conclusion

The objective of this paper is to a suggest robust estimation technique for the functional logistic
regression model. The estimation of the functional parameter in this model cannot be achieved by
the regular method of maximum likelihood. Therefore, we approximate the functional observations
and define the parameter function in a finite-dimensional space generated by a known basis. This
reduces the functional model to a multiple model with highly collinear covariates. The presence of
multicollinearity and outliers causes the estimators from this multiple logistic model to be unstable
and therefore, unreliable.
Robust estimation methods for the functional logistic model are therefore an important tool in
estimation of the parameter function derived in this manner. In this paper, we have proposed an
approach that makes use of robust principal component estimation, and essentially reduces dimen-
sionality and improves the estimation of the parameter function in the presence of multicollinearity
and outliers. From the simulation study, we have shown that in the presence of outliers, this
19
approach results in better estimations for the parameter function, and subsequently better inter-
pretation of the model. We also illustrated the improved performance of the proposed method, by
analysing a real data set.

20
References

[1] L.S. Aucott, P.H. Garthwaite, and J. Currall. Regression methods for high dimensional multi-
collinear data. Communications in Statitsics: Computation and Simulation, 29(4):1021 – 1037,
2000.
[2] J. L. Bali, G. Boente, D. E. Tyler, and J.-L. Wang. Robust functional principal components:
A projection-pursuit approach. The Annals of Statistics, 39:2852 – 2882, 2011.
[3] G. Boente and R. Fraiman. Discussion of robust principal components for functional data by
locantore et al. Test, 8:28 – 35, 1999.
[4] H. Cardot, F. Ferraty, and P. Sarda. Functional linear model. Statistics and Probability Letters,
45:11 – 22, 1999.
[5] M. Escabias, A.M. Aguilera, and M.J. Valderrama. Principal component estimation of func-
tional logistic regression: Discussion of two different approaches. Journal of Nonparametric
Statistics, 16(3 – 4):365 – 384, 2004.
[6] M. Escabias, A.M. Aguilera, and M.J. Valderrama. Modeling environment data by functional
principal component logistic regression. Environmetrics, 16:95 – 107, 2005.
[7] M. Febrero, P. Galeano, and W. González-Manteiga. Outlier detection in functional data by
depth measures, with application to identify abnormal NOx levels. Environmetrics, 19:331 –
345, 2007.
[8] F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Practice.
Springer, 2006.
[9] R. Fraiman and G. Muniz. Trimmed means for functional data. Test, 10:419 – 440, 2001.
[10] D. Gervini. Robust functional estimation using the median and spherical principal components.
Biometrika, 95:587 – 600, 2008.
[11] D. Gervini. Detecting and handling outlying trajectories in irregularly sampled functional
datasets. The Annals of Applied Statistics, 3:1758 – 1775, 2009.
[12] J. Goldsmith, J. Bobb, C.M. Crainiceanu, B. Caffo, and D. Reich. Penalized functional regres-
sion. Journal of Computational and Graphical Statistics, 20:830 – 851, 2011.
[13] R.R. Hocking. The analysis and selection of variables in linear regression. Biometrics, 32:1 –
49, 1976.
[14] D.W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, second edition, 2000.

21
[15] M. Hubert, P.J. Rousseeuw, and K.V. Branden. Robpca: a new approach to robust principal
component analysis. Technometrics, 47(1):64 – 79, 2005.
[16] G.M. James. Generalized linear models with functional predictors. Journal of the Royal
Statistical Society, Series B, 64(3):411 – 432, 2002.
[17] S. Le Cessie and J.C. van Houwelingen. Ridge estimators in logistic regression. Journal of the
Royal Statistical Society, 41(1):191 – 201, 1992.
[18] X. Leng and H.G. Müller. Classification using functional data analysis for temporal gene
expression data. Bioinformatics, 22:68 – 76, 2006.
[19] H.G. Müller and U. StadtMüller. Generalized functional linear models. The Annals of Statis-
tics, 33(2):774 – 805, 2005.
[20] R.T. Ogden and P.T. Reiss. Functional generalized linear models with images as predictors.
Biometrics, 66:61 – 69, 2010.
[21] R Development Core Team. R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria, 2011. URL
http://www.R-project.org. ISBN 3-900051-07-0.
[22] J.O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer, second edition, 2005.
[23] J.O. Ramsay, H. Wickham, S. Graves, and G. Hooker. Functional Data Analysis, 2012. URL
http://www.cran.r-project.org/web/packages/fda.
[24] S.J. Ratcliffe, G.Z. Heller, and L.R. Leader. Functional data analysis with application to
periodically simulated foetal heart rate data ii: Functional logistic regression. Statistics in
Medicine, 21:1115 – 1127, 2002.
[25] P.T. Reiss, R.T. Ogden, J.J. Mann, and R.V. Parsey. Functional logistic regression with
pet imaging data: A voxel-level clinical diagnostic tool. Journal of Cerebral Blood Flow and
Metabolism, 25(S635), 2005.
[26] P. Sawant, N. Billor, and H. Shin. Functional outlier detection with robust functional principal
component analysis. Computational Statistics, 27(1):83 – 102, 2012.
[27] S.T. Tian. Functional data analysis in brain imaging studies. Frontiers in Psychology, 1(35),
2010.
[28] V. Todorov. Robust Location and Scatter Estimation and Robust Multivariate Analysis with
High Breakdown Point, 2012. URL http://www.cran.r-project.org/web/packages/rrcov.

22

View publication stats

You might also like