Functional Data Analysis With R
Functional Data Analysis With R
with R
Emerging technologies generate data sets of increased size and complexity that require new or updated statisti-
cal inferential methods and scalable, reproducible software. These data sets often involve measurements of a
continuous underlying process, and benefit from a functional data perspective. Functional Data Analysis with R
presents many ideas for handling functional data including dimension reduction techniques, smoothing, func-
tional regression, structured decompositions of curves, and clustering. The idea is for the reader to be able to
immediately reproduce the results in the book, implement these methods, and potentially design new methods
and software that may be inspired by these approaches.
Features:
• Functional regression models receive a modern treatment that allows extensions to many practical scenarios
and development of state-of-the-art software.
• The connection between functional regression, penalized smoothing, and mixed effects models is used as the
cornerstone for inference.
• Multilevel, longitudinal, and structured functional data are discussed with emphasis on emerging functional
data structures.
• Methods for clustering functional data before and after smoothing are discussed.
• Multiple new functional data sets with dense and sparse sampling designs from various application areas are
presented, including the NHANES linked accelerometry and mortality data, COVID-19 mortality data, CD4
counts data, and the CONTENT child growth study.
• Step-by-step software implementations are included, along with a supplementary website (www.Functional-
DataAnalysis.com) featuring software, data, and tutorials.
• More than 100 plots for visualization of functional data are presented.
Functional Data Analysis with R is primarily aimed at undergraduate, master’s, and PhD students, as well as data
scientists and researchers working on functional data analysis. The book can be read at different levels and com-
bines state-of-the-art software, methods, and inference. It can be used for self-learning, teaching, and research, and
will particularly appeal to anyone who is interested in practical methods for hands-on, problem-forward functional
data analysis. The reader should have some basic coding experience, but expertise in R is not required.
Ciprian M. Crainiceanu is Professor of Biostatistics at Johns Hopkins University working on wearable and im-
plantable technology (WIT), signal processing, and clinical neuroimaging. He has extensive experience in mixed
effects modeling, semiparametric regression, and functional data analysis with application to data generated by
emerging technologies.
Jeff Goldsmith is Associate Dean for Data Science and Associate Professor of Biostatistics at the Columbia Uni-
versity Mailman School of Public Health. His work in functional data analysis includes methodological and com-
putational advances with applications in reaching kinematics, wearable devices, and neuroimaging.
Andrew Leroux is an Assistant Professor of Biostatistics and Informatics at the University of Colorado. His
interests include the development of methodology in functional data analysis, particularly related to wearable
technologies and intensive longitudinal data.
Erjia Cui is an Assistant Professor of Biostatistics at the University of Minnesota. His research interests include
developing functional data analysis methods and semiparametric regression models with reproducible software,
with applications in wearable devices, mobile health, and imaging.
MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY
The Statistical Analysis of Multivariate Failure Time Data: A Marginal Modeling Approach
Ross L. Prentice and Shanshan Zhao 163
Sparse Graphical Modeling for High Dimensional Data: Sparse Graphical Modeling for High
Dimensional Data
Faming Liang and Bochao Jia 172
© 2024 Ciprian M. Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia Cui
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
Names: Crainiceanu, Ciprian, author. | Goldsmith, Jeff, author. | Leroux, Andrew, author. | Cui, Erjia,
author.
Title: Functional data analysis with R / Ciprian Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia
Cui.
Description: First edition. | Boca Raton : CRC Press, 2024. |
Series: CRC monographs on statistics and applied probability | Includes bibliographical references and
index. | Summary: “Functional Data Analysis with R is primarily aimed at undergraduate, masters, and
PhD students, as well as data scientists and researchers working on functional data analysis. The book
can be read at different levels and combines state-of-the-art software, methods, and inference. It can be
used for self-learning, teaching, and research, and will particularly appeal to anyone who is interested in
practical methods for hands-on, problem-forward functional data analysis. The reader should have some
basic coding experience, but expertise in R is not required”-- Provided by publisher.
Identifiers: LCCN 2023041843 (print) | LCCN 2023041844 (ebook) | ISBN 9781032244716 (hbk) | ISBN
9781032244723 (pbk) | ISBN 9781003278726 (ebk)
Subjects: LCSH: Multivariate analysis. | Statistical functionals. | Functional analysis. | R (Computer
program language)
Classification: LCC QA278 .C73 2024 (print) | LCC QA278 (ebook) | DDC
519.5/35--dc23/eng/20231221
LC record available at https://lccn.loc.gov/2023041843
LC ebook record available at https://lccn.loc.gov/2023041844
DOI: 10.1201/9781003278726
Typeset in CMR10
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To Bianca, Julia, and Adina,
may your life be as beautiful as you made mine.
Ciprian
To Tushar, mom, and dad, thank you for all you do to keep me
centered and sane.
To Sarina and Nikhil, you’re all a parent could ever ask for.
Never stop shining your light on the world.
Andrew
Preface xi
1 Basic Concepts 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 NHANES 2011–2014 Accelerometry Data . . . . . . . . . . . . . . . 3
1.2.2 COVID-19 US Mortality Data . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 CD4 Counts Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 The CONTENT Child Growth Study . . . . . . . . . . . . . . . . . 15
1.3 Notation and Methodological Challenges . . . . . . . . . . . . . . . . . . . 19
1.4 R Data Structures for Functional Observations . . . . . . . . . . . . . . . . 20
1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vii
viii Contents
Bibliography 291
Index 313
Preface
Around the year 2000, several major areas of statistics were witnessing rapid changes:
functional data analysis, semiparametric regression, mixed effects models, and software
development. While none of these areas was new, they were all becoming more mature, and
their complementary ideas were setting the stage for new and rapid advancements. These
developments were the result of the work of thousands of statisticians, whose collective
achievements cannot be fully recognized in one monograph. We will try to describe some
of the watershed moments that directly influenced our work and this book. We will also
identify and contextualize our contributions to functional data analysis.
The Functional Data Analysis (FDA) book of Ramsay and Silverman [244, 245] was first
published in 1997 and, without a doubt, defined the field. It considered functions as the basic
unit of observation, and introduced new data structures, new methods, and new definitions.
This amplified the interest in FDA, especially with the emergence of new, larger, and more
complex data sets in the early 2000s. Around the same time, and largely independent of
the FDA literature, nonparametric modeling was subject to massive structural changes.
Starting in the early 1970s, the seminal papers of Grace Wahba and collaborators [54,
150, 303] were setting the stage for smoothing spline regression. Likely influenced by these
ideas, in 1986, Finbarr O’Sullivan [221] published the first paper on penalized splines (B-
splines with a smaller number of knots and a penalty on the roughness of the regression
function). In 1996, Marx and Eilers [71] published a seminal paper on P-splines (similar to
O’Sullivan’s approach, but using a different penalty structure) and followed it up in 2002 by
showing that ideas can be extended to Generalized Additive Models (GAM) [72]. In 1999,
Brumback, Ruppert, and Wand [26] pointed out that regression models incorporating splines
with coefficient penalties can be viewed as particular cases of Generalized Linear Mixed
Models (GLMM). This idea was expanded upon in a series of papers that led to the highly
influential Semiparametric Regression book by Ruppert, Wand, and Carroll [258], which
was published in 2003. The book showed that semiparametric models could incorporate
additional covariates, random effects, and nonparametric smoothing components in a unified
mixed effects inferential framework. It also demonstrated how to implement these models
in existing mixed effects software. Simon Wood and his collaborators, in a series of papers
that culminated with the 2006 Generalized Additive Models book [315], set the current
standards for methods and software integration for GAM. The substantially updated 2017
second edition of this book [319] is now a classic reference for GAM.
In early 2000s, the connection between functional data analysis, semiparametric regres-
sion, and mixed effects models was not yet apparent, though some early cross-pollination
work was starting to appear. In 1999, Marx and Eilers [192] introduced the idea of P-splines
for signal regression, which is closely related to the Functional Linear Model with a scalar
outcome and functional predictors described by Ramsay and Silverman; see also extensions
in the early 2000s [72, 171, 193]. In 2007, Reiss and Ogden [252] introduced a version of the
method proposed by Marx and Eilers [192] using a different penalty structure, described
methods for principal component regression (FPCR) and functional partial least squares
(FPLS), and noted the connection with the mixed effects model representation of penalized
splines described in [258]. In spite of these crucial advancements, by 2008 there was still
xi
xii Preface
no reliable FDA software for implementing these methods. In 2008, Wood gave a Royal
Scientific Society (RSS) talk (https://rb.gy/o1zg5), where he showed how to use mgcv to
fit scalar-on-function regression (SoFR) models using “linear functional terms.” This talk
clarified the conceptual and practical connections between functional and semiparametric
regression; see pages 17–20 of his presentation. In a personal note, Wood mentioned that
his work was influenced by that of Eilers, Marx, Reiss, and Ogden, though he points to
Wahba’s 1990 book [304] and Tikhonov, 1963 [294] as his primary sources of inspiration. In
his words: “[Grace Wahba’s equation] (8.1.4), from Tikhonov, 1963, is essentially the signal
regression problem. It just took me a long time to think up the summation convention idea
that mgcv uses to implement this.” In 2011, Wood published the idea of penalized spline
estimation for the functional coefficient in the SoFR context; see Section 5.2 in his paper,
where methods are extended to incorporate non-Gaussian errors with multiple penalties.
Our methods and philosophy were also informed by many sources, including the now
classical references discussed above. However, we were heavily influenced by the mixed ef-
fects representation of semiparametric models introduced by Ruppert, Wand, and Carroll
[258]. Also, we were interested in the practical implementation and scalability of a variety of
FDA models beyond the SoFR model. The 2010 paper by Crainiceanu and Goldsmith [48]
and the 2011 paper led by Goldsmith and Bobb [102] outlined the philosophy and practice
underlying much of the functional regression chapters of this book: (1) where necessary,
project observed functions on a functional principal component basis to account for noise,
irregular observation grids, and/or missing data; (2) use rich-basis spline expansions for
functional coefficients and induce smoothing using penalties on the spline coefficients; (3)
identify the mixed effects models that correspond to the specific functional regression; and
(4) use existing mixed effects model software (in their case WinBUGS [187] and nlme [230],
respectively) to fit the model and conduct inference. Regardless of the underlying software
platform, one of our main contributions was to recognize the deep connections between
functional regression, penalized spline smoothing, and mixed effects inference. This allowed
extensions that incorporated multiple scalar covariates, random effects, and multiple func-
tional observations with or without noise, with dense or sparse sampling patterns, and
complete or missing data. Over time, the inferential approach was extended to scalar-on-
function regression (SoFR), function-on-scalar regression (FoSR), and function-on-function
regression (FoFR). We have also contributed to increasing awareness of new data structures
and the need for validated and supported inferential software.
Around 2010–2011, Philip Reiss and Crainiceanu initiated a project to assemble existing
R functions for FDA. It was started as the package refund [105] for “REgression with
FUNctional Data,” though it never provided any refund, it was not only about regression,
and was not particularly easy to find on Google. However, it did bring together a group of
statisticians who were passionate about developing FDA software for a wide audience. We
would like to thank all of these contributors for their dedication and vision. The refund
package is currently maintained by Julia Wrobel.
Fabian Scheipl, Sonja Greven, and collaborators have led a series of transformative pa-
pers [128, 262, 263] that started to appear in 2015 and expanded functional regression in
many new directions. The 2015 paper by Ivanescu, Staicu, Scheipl, and Greven [128] showed
how to conduct function-on-function regression (FoFR) using the philosophy outlined by
Goldsmith, Bobb, and Crainiceanu [48, 102]. The paper made the connection to the “lin-
ear functional terms” implementation in mgcv, which merged previously disparate lines of
work in FDA. This series of papers led to substantial revisions of the refund package and
the addition of the powerful function pffr(), which provides a functional interface based
on the mgcv package. The function pfr(), initially developed by Goldsmith, was updated to
the same standard. Scheipl’s contributions to refund were transformative and set a new bar
for FDA software. Finally, the ideas came together and showed how functional regression
Preface xiii
can be modeled semiparametrically using splines, smoothness can be induced via specific
penalties on parameters, and penalized models can be treated as mixed effects models, which
can be fit using modern software. This body of work provides much of the infrastructure of
Chapters 4, 5, and 6 of this book.
To address the larger and increasingly complex data applications, new methods were
required for Functional Principal Components Analysis (FPCA). To the best of our knowl-
edge, in 2010 there was no working software for smoothing covariance matrices for functional
data with more than 300 observations per function. Luo Xiao was one of the main contrib-
utors who introduced the FAst Covariance Estimation (FACE), a principled method for
nonparametric smoothing of covariance operators for high and ultra-high dimensional func-
tional data. Methods use “sandwich estimators” of covariance matrices that are guaranteed
to be symmetric and positive definite and were deployed in the refund::fpca.face()
function [331]. Xiao’s subsequent work on sparse and multivariate sparse FPCA
was deployed as the standalone functions face::face.sparse() [328, 329] and
mfaces::mface.sparse() [172, 173]. During the writing of this book, it became appar-
ent that methods were also needed for FPCA-like methods for non-Gaussian functional
data. Andrew Leroux and Wrobel led a paper on fast generalized FPCA (fastGFPCA) [167]
using local mixed effects models and deployed the accompanying fastGFPCA package [324].
These developments are highlighted in Chapters 2 and 3 of this book.
Much less work has been dedicated to survival analysis with functional predictors and, es-
pecially, to extending the semiparametric regression ideas to this context. In 2015, Jonathan
Gellar introduced the Penalized Functional Cox Regression [94], where the effect of the func-
tional predictor on the log-hazard was modeled using penalized splines. However, methods
were not immediately deployed in mgcv because this option only became available in 2016
[322]. In subsequent publications, Leroux [164, 166] and Erjia Cui [55, 56] made clear the
connection to the “linear functional terms” in mgcv and substantially enlarged the range of
applications of survival analysis with functional predictors. This work provides the infras-
tructure for Chapter 7 of this book.
In 2009, Chongzhi Di, Crainiceanu, and collaborators introduced the concept of Multi-
level Functional Principal Component (MFPCA) for functional data observed at multiple
visits (e.g., electroencephalograms at every 30 seconds during sleep at two visits several
years apart). They developed and deployed the refund::mfpca.sc() function. A much im-
proved version of the software was deployed recently in the refund::mfpca.face() function
based on a paper led by Cui and Ruonan Li [58]. Much work has been dedicated to ex-
tending ideas to structured functional data [272, 273], led by Haochang Shou, longitudinal
functional data [109], led by Greven, and ultra-high dimensional data [345, 346], led by
Vadim Zipunnikov. Many others have provided contributions, including Ana-Maria Staicu,
Goldsmith, and Lei Huang. Fast methods for fixed effects inference in this context were
developed, among others, by Staicu [223] and Cui [57]. These methods required specialized
software to deal with the size and complexity of new data sets. This work forms the basis
of Chapter 8 of this book.
As we were writing this book we realized just how many open problems still remain.
Some of these problems have been addressed along the way; some are still left open. In the
end, we have tried to provide a set of coherent analytic tools based on statistical principled
approaches. The core set of ideas is to model functional coefficients parametrically or non-
parametrically using splines, penalize the spline coefficients, and conduct inference in the
resulting mixed effects model. The book is accompanied by detailed software and a website
http://www.FunctionalDataAnalysis.com that will continue to be updated.
We hope that you enjoy reading this book as much as we enjoyed writing it.
1
Basic Concepts
Our goal is to create the most useful book for the widest possible audience without theo-
retical, methodological, or computational compromise.
Our approach to statistics is to identify important scientific problems and meaning-
fully contribute to solving them through timely engagement with data. The development
of general-purpose methodology is motivated by this process, and must be accompanied
by computational tools that facilitate reproducibility and transparency. This “problem for-
ward” approach is critical as technological advances rapidly increase the precision and vol-
ume of traditional measurements, produce completely new types of measurements, and open
new areas of scientific research.
Our experience in public health and medical research provides numerous examples of new
technologies that reshape scientific questions. For example, heart rate and blood pressure
used to be measured once a year during an annual medical exam. Wearable devices can now
measure them continuously, including during the night, for weeks or months at the time.
The resulting data provide insights into blood pressure, hypertension, and health outcomes
and open completely new areas of research. New types of measurements are continuously
emerging, including physical activity measured by accelerometers, brain imaging, ecological
momentary assessments (EMA) via smart phone apps, daily infection and mortality during
the COVID-19 pandemic, or CD4 counts from the time of sero-conversion. These examples
and many others involve measurements of a continuous underlying process, and benefit from
a functional data perspective.
1.1 Introduction
Functional Data Analysis (FDA) provides a conceptual framework for analyzing functions
instead of or in addition to scalar measurements. For example, physical activity is a con-
tinuous process over the course of the day and can be observed for each individual; FDA
considers the complete physical activity trajectory in the analysis instead of reducing it to a
single scalar summary, such as the total daily activity. In this book we denote the observed
functions by Wi : S → R, where S is an interval (e.g., [0, 1] in R or [0, 1]M in RM ), i is the
basic experimental unit (e.g., study participant), and Wi (s) is the functional observation
for unit i at s ∈ S. In general, the domain S does not need to be an interval, but for the
purposes of this book we will work under this assumption.
We often assume that Wi (s) = Xi (s) + i (s), where Xi : S → R is the true functional
process and i (s) are independent noise variables. We will see various generalizations of
this definition, but for illustration purposes we use this notation. We briefly summarize
the properties of functional data that can be used to better target the associated analytic
methods:
1
2 Functional Data Analysis with R
• Continuity is the property of the observed functions, Wi (s), and true functional pro-
cesses, Xi (s), which allows it to be sampled at a higher or lower resolution within S.
• Ordering is the property of the functional domain, S, which can be ordered and has a
distance.
• Self-consistency is the property of the observed functions, Wi (s), and true functional
processes, Xi (s), to be on the same scale and have the same interpretation for all ex-
perimental units, i, and functional arguments, s.
• Smoothness is the property of the true functional process, Xi (s), which is not expected
to change substantially for small changes in the functional argument, s.
• Colocalization is the property of the functional argument, s, which has the same inter-
pretation for all observed functions, Wi (s), and true functional processes, Xi (s).
These properties differentiate functional from multivariate data. As the functional ar-
gument, s ∈ S, is often time or space, FDA can be used for modeling temporal and/or
spatial processes. However, there is a fundamental difference between FDA and spatio-
temporal processes. Indeed, FDA assumes that the observed functions, Wi (s), and true
functional processes, Xi (s), depend on and are indexed by the experimental unit i. This
means that there are many repetitions of the time series or spatial processes, which is not
the case for time series or spatial analysis.
The FDA framework serves to guide methods development, interpretation, and ex-
ploratory analysis. We emphasize that the concept of continuously observed functions differs
from the practical reality that functions are observed over discrete grids that can be dense or
sparse, regularly spaced or irregular, and common or unique across functional observations.
Put differently, in practice, functional data are multivariate data with specific properties.
Tools for understanding functional data must bridge the conceptual and practical to produce
useful insights that reflect the data-generating and observation processes.
FDA has a long and rich tradition. Its beginnings can be traced at least to a paper
by C.R. Rao [247], who proposed to use Principal Component Analysis (PCA), a mul-
tivariate method, to analyze growth curves. Several monographs on FDA already exist,
including [86, 153, 242, 245]. In addition, several survey papers provide insights into current
developments [154, 205, 250, 299, 307]. This book is designed to complement the existing
literature by focusing on methods that (1) combine parametric, nonparametric, and mixed
effects components; (2) provide statistically principled approaches for estimation and in-
ference; (3) allow users to seamlessly add or remove model components; (4) are associated
with high-quality, fast, and easy-to-modify R software; and (5) are intuitive and friendly to
scientific applications.
This book provides an introduction to FDA with R [240]. Two packages will be used
throughout the book: (1) refund [105], which contains a large number of FDA models
and many of the data sets used for illustration in this book; and (2) mgcv [317, 319], a
powerful inferential software developed for semiparametric inference. We will show how
this software, originally developed for semiparametric regression, can be adapted to FDA.
This is a crucial contribution of the book, which is built around the idea of providing
tools that can be readily used in practice. The book is accompanied by the web page
http://www.FunctionalDataAnalysis.com, which contains vignettes and R software for each
chapter of this book. All vignettes use the refund and mgcv packages, which are available
from CRAN and can be loaded into R [240] as follows.
library(refund)
library(mgcv)
Basic Concepts 3
General-purpose, stable, and fast software is the key to increasing the popularity of FDA
methods. The book will present the current version of the software, while acknowledging
that software is changing much faster than methodology. Thus, the book will change slowly,
while the web page http://www.FunctionalDataAnalysis.com and accompanying vignettes
will be adapted to the latest developments.
1.2 Examples
We now introduce several examples that illustrate the ubiquity and complexity of functional
data in modern research, and that will be revisited throughout the book. These examples
highlight various types of functional data sampling, including dense, regularly-spaced grids
that are common across participants, and sparse, irregular observations for each participant.
FIGURE 1.1: Physical activity data measured in MIMS for three study participants in the
NHANES 2011–2014 summarized at every minute of the day. Each study participant is
shown in one column and each row corresponds to a day of the week. The x-axis in each
panel is time in one-minute increments from midnight to midnight.
Basic Concepts 5
Friday had less than 95% of “good data” and were therefore excluded. The x-axis for each
panel is time in one-minute increments from midnight (beginning of the day) to midnight
(end of the day). The y-axis is MIMS, a measure of physical activity intensity.
Some features of the data become apparent during visual inspection of Figure 1.1. First,
activity during the night (0–6 AM) is reduced for the first two study participants, but
not for the third. Indeed, study participant SEQN 82410 has clearly more activity during
the night than during the day (note the consistent dip in activity between 12 PM and 6
PM). Second, there is substantial heterogeneity of the data from one minute to another
both within and between days. Third, data are positive and exhibit substantial skewness.
Fourth, the patterns of activity of study participant SEQN 75111 on Saturday and Sunday
are quite different from their pattern of activity on the other days of the week. Fifth, there
seems to be some day-to-day within-individual consistency of observations.
Having multiple days of minute-level physical activity for the same individual increases
the complexity and size of the data. A potential solution is to take averages at the same
time of the day within study participants. This is equivalent to averaging the curves in
Figure 1.1 by column at the same time of the day. This reduces the data to one function per
study participant, but ignores the visit-to-visit variability around the person-specific mean.
To illustrate the population-level data structure, Figure 1.2 displays the smooth means
of several groups within NHANES. Data were smoothed for visualization purposes; techni-
cal details on smoothing are discussed in Section 2.3. The left panel displays the average
physical activity data for individuals who died (blue line) and survived (red line). Mortal-
ity indicators were based on the NHANES mortality release file that included events up to
December 31, 2019. Mortality information was available for 8,713 of the 12,610 study partic-
ipants. There were 832 deceased individuals and 7,881 who were still alive on December 31,
2019. The plot indicates that individuals who did not die had, on average, higher physical
activity throughout the day, with larger differences between 8 AM and 11 PM. This result
is consistent with the published literature on the association between physical activity and
mortality; see, for example, [64, 65, 136, 170, 259, 275, 292].
The right panel in Figure 1.2 displays the smooth average curves for groups stratified by
age and gender. For illustration purposes, four age groups (in years) were used and identified
FIGURE 1.2: Average physical activity data (expressed in MIMS) in NHANES 2011–2014
as a function of the minute of the day in different groups. Left panel: deceased (blue
line) and alive individuals (red line) as of December 31, 2019. Right panel: females (dashed
lines) and males (solid lines) within age groups [18, 35] (red), (35, 50] (orange), (50, 65] (light
blue), and (65, 80] (dark blue).
6 Functional Data Analysis with R
by a different color: [18, 35] (red), (35, 50] (orange), (50, 65] (light blue), and (65, 80] (dark
blue). Within each age group, data for females is shown as dashed lines and for males as
solid lines. In all subgroups physical activity averages are lower at night, increase sharply in
the morning and remain high during the day. The average for the (50, 65] and (65, 80] age
groups exhibit a steady decrease during the day. This pattern is not apparent in the younger
age groups. These findings are consistent with the activity patterns described in [265, 327].
In addition, for every age group, the average activity during the day is higher for females
compared to males. During the night, females have the same or slightly lower activity than
males. These results contradict the widely cited literature [296] which indicated that “Males
are more physically active than females.” However, they are consistent with [327], which
found that women are more active than men, especially among older individuals.
Rich, complex data as displayed in Figures 1.1 and 1.2 suggest multiple scientific prob-
lems, including (1) quantifying the association between physical activity patterns and health
outcomes (e.g., prevalent diabetes or stroke) with or without adjustment for other covari-
ates (e.g., age, gender, body mass index); (2) identifying which specific components of
physical activity data are most predictive of future health outcomes (e.g., incident mor-
tality or cardiovascular events); (3) visualizing the directions of variation in the data; (4)
investigating whether clusters exist and if they are scientifically meaningful; (5) evaluating
transformations of the data that may provide complementary information; (6) developing
prediction methods for missing observations (e.g., one hour of missing data for a person);
(7) quantifying whether the timing or fragmentation of physical activity provides additional
information above and beyond summary statistics (e.g., mean, standard deviation over the
day); (8) studying how much data are needed to identify a particular study participant; (9)
predicting the activity for the rest of the day given data up to a particular time and day
(e.g., 12 PM on Sunday); (10) determining what levels of data aggregation (e.g., minute,
hour, day) may be most useful for specific scientific questions; and (11) proposing data
generating mechanisms that could produce data similar to the observed data.
The daily physical activity curves have all the properties that define functional data:
continuity, ordering, self-consistency, smoothness, and colocalization. The measured process
is continuous, as physical activity is continuous. While MIMS were summarized at the
minute level, data aggregation could have been done at a finer (e.g., ten-, or one-second
intervals) or coarser (e.g., one- or two-hour intervals) scale. The functional data have the
ordering property, because the functional argument is time during the day, which is both
ordered and has a well-defined distance. The data and the measured process have the self-
consistency property because all observations are expressed in MIMS at the minute level.
The true functional process can be assumed to have the smoothness property, as one does not
expect physical activity to change substantially over short periods of time (e.g., one second).
The functional argument has the colocalization property, as the time when physical activity
is measured (e.g., 12:00 PM) has the same interpretation for every study participant and
day of measurement.
The observed data can be denoted as a function Wim : S → R+ , where Wim (s) is the
MIMS measurement at minute s ∈ S = {1, . . . , 1440} and day m = 1, . . . , Mi , where Mi is
the number of days with high-quality physical activity data for study Miparticipant i. Data
complexity could be reduced by taking the average Wi (s) = M1i m=1 Wim (s) at every
1
|S| Mi
minute s or the average over days and minutes W i = Mi |S| s=1 m=1 Wim (s), where |S|
denotes the number of elements in the domain S. Such reductions in complexity improve
interpretability and make analyses easier, though some information may be lost. Deciding
at what level to summarize the data without losing crucial information is an important goal
of FDA.
Basic Concepts 7
Here we have identified the domain of the functions Wim (·) as S = {1, . . . , 1440}, which
is a finite set in R and does not satisfy the basic requirement that S is an interval. This
could be a major limitation as basic concepts such as continuity or smoothness of the
functions cannot be defined on the sampling domain S = {1, . . . , 1440}. This is due to
practical limitations of sampling that can only be done at a finite number of points. Here the
theoretical domain is [0, 1440] minutes, or [0, 24] hours, or [0, 1] days, depending on how we
normalize the domain. Recall that the functions have the continuity property, which assumes
that the function could be measured anywhere within this theoretical domain. While not
formally correct, we will refer to both of these domains as S to simplify the exposition;
whenever necessary we will indicate more precisely when we refer to the theoretical (e.g.,
S = [0, 1440]) or sampling (e.g., S = {1, . . . , 1440}) domain. This slight abuse of notation
will be used throughout the book and clarifications will be added, as needed.
#Load refund
library(refund)
#Load the COVID-19 data
data(COVID19)
Among other variables, this data set contains the US weekly number of all-cause deaths,
weekly number of deaths due to COVID-19 (as assessed on the death certificate), and
population size in the 50 US states plus Puerto Rico and District of Columbia as of July
1, 2020. Figure 1.3 displays the total weekly number of deaths in the US between the week
ending on January 14, 2017 and the week ending on December 12, 2020 for a total of
205 weeks. The original data source is the National Center for Health Statistics (NCHS)
and the data set link is called National and State Estimates of Excess Deaths. It can be
accessed from https://bit.ly/3wjMQBY. The file can be downloaded directly from https:
//bit.ly/3pMAAaA. The data stored in the COVID19 data set in the refund package contains
an analytic version of these data as the variable US weekly mort.
In Figure 1.3, each dot corresponds to one week and the number of deaths is expressed in
thousands. For example, there were 61,114 deaths in the US in the week ending on January
14, 2017. Here we are interested in excess mortality in the first 52 weeks of 2020 compared
to the first 52 weeks of 2019. The first week of 2020 is the one ending on January 4, 2020
and the 52nd week is the one ending on December 26, 2020. There were 3,348,951 total
deaths in the US in the first 52 weeks of 2020 (red shaded area in Figure 1.3) and 2,852,747
deaths in the first 52 weeks of 2019 (blue shaded area in Figure 1.3). Thus, there were
496,204 more deaths in the US in the first 52 weeks of 2020 than in the first 52 weeks
of 2019. This is called the (raw) excess mortality in the first 52 weeks of the year. Here
we use this intuitive definition (number of deaths in 2020 minus the number of deaths in
2019), though slightly different definitions can be used. Indeed, note that the population
size increases from 2019 to 2020 and some additional deaths can be due to the increase in
8 Functional Data Analysis with R
FIGURE 1.3: Total weekly number of deaths in the US between January 14, 2017 and
December 12, 2020. The COVID-19 epidemic is thought to have started in the US sometime
between January and March 2020.
population. For example, the US population was 330,024,493 on December 26, 2020 and
329,147,064 on December 26, 2019 for an increase of 877,429. Using the mortality rate in
2019 of 0.0087 (number of deaths divided by the total population), the expected increase
in number of deaths due to increase in the population would be 7,634. Thus, the number of
deaths associated with the natural increase in population is about 1.5% of the total excess
all-cause deaths in 2020 compared to 2019.
Figure 1.3 displays a higher mortality peak at the end of 2017 and beginning of 2018,
which is likely due to a severe flu season. The CDC estimates that in the 2017–2018 flu
season in the US there were “an estimated 35.5 million people getting sick with influenza,
16.5 million people going to a health care provider for their illness, 490,600 hospitalizations,
and 34,200 deaths from influenza” (https://bit.ly/3H8fa1b).
As indicated in Figure 1.3, the excess mortality can be calculated for every week from the
beginning of 2020. The blue dots in Figure 1.4 display this weekly excess all-cause mortality
as a function of time from January 2020. Excess mortality is positive in every week with an
average of 9,542 excess deaths per week for a total of 496,204 excess deaths in the first 52
weeks. Excess mortality is not a constant function over the year. For example, there were an
average of 1,066 all-cause excess deaths per week between January 1, 2020 and March 14,
2020. In contrast, there were an average of 14,948 all-cause excess deaths per week between
March 28, 2020 and June 23, 2020.
One of the most watched indicators of the severity of the pandemic in the US was
the number of deaths attributed to COVID-19. The data is made available by the US
Center for Disease Control and Prevention (CDC) and can be downloaded directly from
Basic Concepts 9
FIGURE 1.4: Total weekly number of deaths attributed to COVID-19 and excess mor-
tality in the US. The x-axis is time expressed in weeks from the first week in 2020. Red
dots correspond to weekly number of deaths attributed to COVID-19. Blue dots indicate
the difference in the total number of deaths between a particular week in 2020 and the
corresponding week in 2019.
https://bit.ly/3iE2xjo. The data stored in the COVID19 data set in the refund package
contains an analytic version of these data as the variable US weekly mort CV19. The red
dots in Figure 1.4 represent the weekly mortality attributed to COVID-19 according to the
death certificate. Visually, COVID-19 and all-cause excess mortality have a similar pattern
during the year with some important differences: (1) all-cause excess mortality is larger
than COVID-19 mortality every week; (2) the main association does not seem to be delayed
(lagged) in either direction; and (3) the difference between all-cause excess and COVID-19
mortality as a proportion of COVID-19 mortality is highest in the summer.
Figure 1.4 indicates that there were more excess deaths than COVID-19 attributed
deaths in each week of 2020. In fact, the total US all-cause excess deaths in the first 52 weeks
of 2020 was 496,204 compared to 365,122 deaths attributed to COVID-19. The difference is
131,082 deaths, or 35.9% more excess deaths than COVID-19 attributed deaths. So, what
are some potential sources for this discrepancy? In some cases, viral infection did occur
and caused death, though the primary cause of death was recorded as something else (e.g.,
cardiac or pulmonary failure). This could happen if death occurred after the infection had
already passed, infection was present and not detected, or infection was present but not
adjudicated as the primary cause of death. In other cases, viral infection did not occur, but
the person died due to mental or physical health stresses, isolation, or deferred health care.
There could also be other reasons that are not immediately apparent.
10 Functional Data Analysis with R
FIGURE 1.5: Each line represents the cumulative weekly all-cause excess mortality per
million for each US state plus Puerto Rico and District of Columbia. Five states are empha-
sized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California
(plum).
In addition to data aggregated at the US national level, the COVID19 data contains
similar data for each state and two territories, Puerto Rico and Washington DC. The
all-cause weekly excess mortality data for each state in the US is stored as the variable
States excess mortality in the COVID19 data set.
Figure 1.5 displays the total cumulative all-cause excess mortality per million in every
state in the US, Puerto Rico and District of Columbia. For each state, the weekly excess
mortality was obtained as described for the US in Figures 1.3 and 1.4. For every week, the
cumulative excess mortality was calculated by adding the excess mortality for every week
up to and including the current week. To make data comparable across states, cumulative
excess mortality was then divided by the estimated population of the state or territory on
July 1, 2020 and multiplied by 1,000,000. Every line represents a state or territory with the
trajectory for five states being emphasized: New Jersey (green), Louisiana (red), Maryland
(blue), Texas (salmon), and California (plum). For example, New Jersey had 1,916 excess
all-cause deaths per one million residents by April 30, 2020. This corresponds to a total of
17,019 excess all-cause deaths by April 30, 2020 because the population of New Jersey was
8,882,371 as of July 1, 2020 (the reference date for the population size).
The trajectories for individual states exhibit substantial heterogeneity. For example,
New Jersey had the largest number of excess deaths per million in the US. Most of these
excess deaths were accumulated in the April–June period, with fewer between June and
November, and another increase in December. In contrast, California had a much lower
cumulative excess number of deaths per million, with a roughly constant increase during
Basic Concepts 11
FIGURE 1.6: Each line represents the cumulative COVID-19 mortality for each US state
plus Puerto Rico and District of Columbia in 2020. Cumulative means that numbers are
added as weeks go by. Five states are emphasized: New Jersey (green), Louisiana (red),
Maryland (blue), Texas (salmon), and California (plum).
2020. Maryland had about a third the number of excess deaths per million of New Jersey
at the end of June and about half by the end of December.
We now investigate the number of weekly deaths attributed to COVID-19 for each state
in the US, which is stored as the variable States CV19 mortality in the COVID19 data set.
Figure 1.6 is similar to Figure 1.5, but it displays the cumulative number of deaths attributed
to COVID-19 for each state per million residents. Each line corresponds to a state and a
few states are emphasized using the same color scheme as in Figure 1.5. The y-axis was kept
the same as in Figure 1.5 to illustrate that, in general, the number of cumulative COVID-19
deaths tends to be lower than the excess all-cause mortality. However, the main patterns
exhibit substantial similarities.
There are many scientific and methodological problems that occur from such a data set.
Here are a few examples: (1) quantifying the all-cause and COVID-19 mortality at the state
level as a function of time; (2) identifying whether the observed trajectories are affected
by geography, population characteristics, weather, mitigating interventions, or intervention
compliance; (3) investigating whether the strength of the association between reported
COVID-19 and all-cause excess mortality varies with time; (4) identifying which states are
the largest contributors to the observed excess mortality in the January–March period; (5)
quantifying the main directions of variation and clusters of state-specific mortality patterns;
(6) evaluating the distribution of the difference between all-cause excess and COVID-19
deaths as a function of state and time; (7) predicting the number of COVID-19 deaths
and infections based on the excess number of deaths; (8) evaluating dynamic prediction
12 Functional Data Analysis with R
models for mortality trajectories; (9) comparing different data transformations for analysis,
visualization, and communication of results; and (10) using data from countries with good
health statistics systems to estimate the burden of COVID-19 in other countries using
all-cause excess mortality.
In the COVID-19 example it is not immediately clear that data could be viewed as
functional. However, the partitioning of the data by state suggests that such an approach
could be useful, at least for visualization purposes. Note that data in Figures 1.5 and 1.6
are curves evaluated at every week of 2020. Thus, the measured process is continuous, as
observations could have been taken at a much finer (e.g., days or hours) or coarser (e.g.,
every month) time scale. Data are ordered by calendar time and is self-consistent because
the number or proportion of deaths has the same interpretation for each state and every
week. Moreover, one can assume that the true number of deaths is a smooth process as the
number of deaths is not expected to change substantially for small changes in time (e.g.,
one hour). Data are also colocalized, as calendar time has the same interpretation for each
state and territory.
The observed data can be denoted as functions Wim : S → R+ , where Wim (s) is the
number or cumulative number of deaths in state i per one million residents at time s ∈
S = {1, . . . , 52}. Here m ∈ 1, 2 denotes all-cause excess mortality (m = 1) and COVID-
19 attributed mortality (m = 2), respectively. Because each m refers to different types of
measurements on the same unit (in this case, US state), this type of data is referred to
as “multivariate” functional data. Observations can be modeled as scalars by focusing, for
example, on Wim (s) at one s at a time or on the average of Wim (s) over s for one m. FDA
focuses on analyzing the entire function or combination of functions, extracting information
using fewer assumptions, and suggesting functional summaries that may not be immediately
evident. Most importantly, FDA provides techniques for data visualization and exploratory
data analysis (EDA) in the original or a transformed data space.
Just as in the case of NHANES physical activity data, the domain of the functions Wim (·)
is S = {1, . . . , 52} expressed in weeks, which is a finite set that is not an interval. This is due
to practical limitations of sampling that can only be done at a finite number of points. Here
the theoretical domain is [0, 52] weeks, or [0, 12] months, or [0, 1] years, depending on how
we normalize the domain. Recall that the functions have the continuity property, which
assumes that the function could be measured anywhere within this theoretical domain.
While not formally correct, we will refer to both of these domains as S to simplify the
exposition.
#Load refund
library(refund)
#Load the CD4 data
data(cd4)
This data contains the CD4 cell counts for 366 HIV infected individuals from the Multi-
center AIDS Cohort Study (MACS) [66, 144]. We would like to thank Professor Peter Diggle
for making this important de-identified data publicly available on his website and for giving
Basic Concepts 13
FIGURE 1.7: Each line represents the log CD4 count as a function of month, where month
zero corresponds to seroconversion. Five study participants are identified using colors: green,
red, blue, salmon, and plum.
us the permission to use it in this book. We would also like to thank the participants in
this MACS sub-study. Figure 1.7 displays the log CD4 count for up to 18 months before
and 42 months after sero-conversion. Each line represents the log CD4 count for one study
participant as a function of month, where month zero corresponds to sero-conversion.
There are a total of 1,888 data points, with between 1 and 11 (median 5) observations
per study participant. Five study participants are highlighted using colors: green, red, blue,
salmon, and plum. Some of the characteristics of these data include (1) there are few obser-
vations per curve; (2) the time of observations is not synchronized across individuals; and (3)
there is substantial visit-to-visit variation in log CD4 counts before and after seroconversion.
Figure 1.8 displays the same data as Figure 1.7 together with the raw (cyan dots)
and smooth (dark red line) estimator of the mean. The raw mean is the average of log CD4
counts of study participants who had a visit at that time. The raw mean exhibits substantial
variation and has a missing observation at time t = 0. The smooth mean estimator captures
the general shape of the raw estimator, but provides a more interpretable summary. For
example, the smooth estimator is relatively constant before seroconversion, declines rapidly
in the first 10–15 months after seroconversion, and continues to decline, but much slower
after month 15. These characteristics are not immediately apparent in the raw mean or in
the person-specific log CD4 trajectories displayed in Figure 1.6.
There are many scientific and methodological problems suggested by the CD4 data
set. Here we identify a few: (1) estimating the time-varying mean, standard deviation and
14 Functional Data Analysis with R
FIGURE 1.8: Each gray line represents the log CD4 count as a function of month, where
month zero corresponds to seroconversion. The point-wise raw mean is shown as cyan dots.
The smooth estimator of the mean is shown as a dark red line.
quantiles of the log CD4 counts as a function of time; (2) producing confidence intervals for
these time-varying population parameters; (3) identifying whether there are specific sub-
groups that have different patterns over time; (4) designing analytic methods that work
with sparse data (few observations per curve that are not synchronized across individuals);
(5) predicting log CD4 observations for each individual at months when measurements were
not taken; (6) predicting the future observations for one individual given observations up to
a certain point (e.g., 10 months after seroconversion); (7) constructing confidence intervals
for these predictions; (8) quantifying the month-to-month measurement error (fluctuations
along the long-term trend); (9) studying whether the month-to-month measurement error
depends on person-specific characteristics, including average log CD4 count; and (10) de-
signing realistic simulation studies that mimic the observed data structure to evaluate the
performance of analytic methods.
Data displayed in Figures 1.7 and 1.8 are observed at discrete time points and with
substantial visit-to-visit variability. We leave it as an exercise to argue that the CD4 data
has the characteristics of functional data: continuity, ordering, self-consistency, smoothness,
and colocalization.
The observed data has the structure {sij , Wi (sij )}, where Wi (sij ) is the log CD4 count
at time sij ∈ S = {−18, −17, . . . , 42}. Here i = 1, . . . , n is study participant, j = 1, . . . , pi
is the observation number, and pi is the number of observations for study participant
i. In statistics, this data structure is often encountered in longitudinal studies and is
Basic Concepts 15
typically modeled using linear mixed effects (LME) models [66, 87, 161, 196]. LMEs use
a pre-specified, typically parsimonious, structure of random effects (e.g., random inter-
cepts and slopes) to capture the person-specific curves. Functional data analysis comple-
ments LMEs by considering more complex and/or data-dependent designs of random effects
[134, 254, 255, 283, 328, 334, 336]. It is worth noting that this data structure and problem
are equivalent to the matrix completion problem [29, 30, 214, 312]. Statistical approaches
can handle different levels of measurement error in the matrix entries, and produce both
point estimators and the associated uncertainty for each matrix entry.
In this example, one could think about the sampling domain as being S =
{−18, −17, . . . , 42} expressed in months. This is a finite set that is not an interval. The
theoretical domain is [−18, 42] in months from seroconversion, though the interval could
be normalized to [0, 1]. The difference from the NHANES and COVID-19 data sets is that
observations are not available at every point in S = {−18, −17, . . . , 42} for each individual.
Indeed, the minimum number of observations per individual is 1 and the maximum is 11,
with a median number of observations of 5, which is 100× 5/(42 +19) = 8.2% of the number
of possible time points between −18 and 42. This type of data is referred to in statistics as
“sparse functional data.” In strict mathematical terms this is a misnomer, as the sampling
domain S = {−18, −17, . . . , 42} is itself mathematically sparse in R. Here we will use the
definition that sparse functional data are observed functions Wi (sij ) where j = 1, . . . , pi , pi
is small (at most 20) at sampling points sij that are not identical across study participants.
Note that this is a property of the observed data Wi (sij ) and not of the true underlying
process, Xi (s), which could be observed/sampled at any point in [−18, 42]. While this defi-
nition is imprecise, it should be intuitive enough for the intents and purposes of this book.
We acknowledge that there may be other definitions and also that there is a continuum of
scientific examples between “dense, equally spaced functional data” and “sparse, unequally
spaced functional data.”
FIGURE 1.9: Longitudinal observations of z-score for length (zlen, first column) and z-score
for weight (zwei, second column) shown for males (first row) and females (second row) as a
function of day from birth. Data for two boys (shown as light and dark shades of red) and
two girls (shown as light and dark shades of blue) are highlighted. The same shade of color
identifies the same individual.
Moreover, not all planned visits were completed, which provided the data a quasi-sparse
structure, as observations are not temporally synchronized across children.
We would like to thank Dr. William Checkley for making this important de-identified
data publicly available and to the members of the communities of Pampas de San Juan
de Miraflores and Nuevo Paraı́so who participated in this study. The data can be loaded
directly using the refund R package as follows.
#Load refund
library(refund)
#Load the CONTENT data
data(content)
Figure 1.9 provides an illustration of the z-score for length (zlen) and z-score for weight
(zwei) variables collected in the CONTENT study. Data are also available on the origi-
nal scale, though for illustration purposes here we display these normalized measures. For
example, the zlen measurement is obtained by subtracting the mean and dividing by the
standard deviation of height for a given age of children as provided by the World Health
Organization (WHO) growth charts.
Even though the study was designed to collect data up to age 2 (24 months), for visu-
alization purposes, observations are displayed only through day 600, as data become very
Basic Concepts 17
FIGURE 1.10: Histogram of the number of days from birth in the CONTENT study. There
are a total of 4,405 observations for 197 children.
sparse thereafter. Data for every individual is shown as a light gray line and four different
panels display the zlen (first column) and zwei (second column) variables as a function of
day from birth separately for males (first row) and females (second row). Data for two boys
is highlighted in the first row of panels in red. The lighter and darker shades of red are used
to identify the same individual in the two panels. A similar strategy is used to highlight
two girls using lighter and darker shades of blue. Note, for example, that both girls who
are highlighted start at about the same length and weight z-score, but their trajectories
are substantially different. The z-scores increase for both height and weight for the first girl
(data shown in darker blue) and decrease for the second girl (data shown in light blue).
Moreover, after day 250 the second girl seems to reverse the downward trend in the z-score
for weight, though that does not happen with her z-score for height, which continues to
decrease.
These data were analyzed in [127, 169] to dynamically predict the growth patterns of
children at any time point given the data up to that particular time. Figure 1.10 displays
the histogram of the number of days from birth in the CONTENT study. There are a total
of 4,405 observations for 197 children, out of which 2006 (45.5% of total) are in the first 100
days and 3,299 (74.9% of total) are in the first 200 days from birth. Observations become
sparser after that, which can also be observed in Figure 1.9.
There are several problems suggested by the CONTENT growth study including (1)
estimating the marginal mean, standard deviation and quantiles of anthropometric mea-
surements as a function of time; (2) producing pointwise and joint confidence intervals for
these time-varying parameters; (3) identifying whether there are particular subgroups or in-
dividuals that have distinct patterns or individual observations; (4) conducting estimation
and inference on the individual growth trajectories; (5) quantifying the contemporaneous
and lagged correlations between various anthropometric measures; (6) estimating anthropo-
18 Functional Data Analysis with R
metric measures when observations were missing; (7) predicting future observations for one
individual given observations up to a certain point (e.g., 6 months after birth); (8) quan-
tifying the month-to-month measurement error and study whether it is differential among
children; (9) developing methods that are designed for multivariate sparse data (few obser-
vations per curve) with the amount of sparsity varying along the observation domain; (10)
identifying outlying observations or patterns of growth that could be used as early warn-
ings of growth stunting; (11) developing methods for studying the longitudinal association
between multivariate growth outcomes and time-dependent exposures, such as infections;
and (12) designing realistic simulation scenarios that mimic the observed data structure to
evaluate the performance of analytic methods.
Data displayed in Figure 1.9 are observed at discrete time points and with substantial
visit-to-visit and participant-to-participant variability. These data have all the characteris-
tics of functional data: continuity, ordering, self-consistency, smoothness, and colocalization.
Indeed, data are continuous because growth curves could be sampled at any time point at
both higher and lower resolutions. The choice for the particular sampling resolution was a
balance between available resources and knowledge about the growth process of humans.
Data are also ordered as observations are sampled in time. That is, we know that a measure-
ment at week 3 was taken before a measurement at month 5 and we know exactly how far
apart the two measurements were taken. Also, the observed and true functional processes
have the self-consistency property as they are expressed in the same units of measurement.
For example, height is always measured in centimeters or is transformed into normalized
measures, such as zlen. Data are also smooth, as the growth process is expected to be grad-
ual and not have large minute-to-minute or even day-to-day fluctuations. Even potential
growth spurts are smooth processes characterized by faster growth but small day-to-day
variation. Observations are also colocalized, as the functional argument, time from birth,
has the same interpretation for all functions. For example, one month from birth means the
same thing for each baby.
The observed functional data in CONTENT has the structure {sij , Wim (sij )}, where
Wim : S → R is the mth anthropometric measurement at time s ∈ S ⊂ [0, 24] (expressed in
months from birth) for study participant i. Here the time of the observations, sij , depends
on the study participant, i, and visit number, j, but not the anthropometric measure, m.
The reason is that if a visit was completed, all anthropometric measures were collected.
However, this may not be the case for all studies and observations may depend on m in
other studies. Each such variation on sampling requires special attention and methods de-
velopment. In this example it is difficult to enumerate the entire sampling domain because
it is too large and observations are not equally spaced. One way to obtain this space in R is
using the function
A similar notation, Wim (s), was used to describe the NHANES data structure in Sec-
tion 1.2.1. In NHANES m referred to the day number from initiating the accelerometry
study. However, in the CONTENT study, m refers to the type of anthropometric measure.
Thus, while in NHANES functions indexed by m measure the same thing every day (e.g.,
physical activity at 12 PM), in CONTENT each function measures something different (e.g.,
zlen and zwei at month 2). In FDA one typically refers to the NHANES structure as mul-
tilevel and to the CONTENT structure as multivariate functional data. Another difference
is that data are not equally spaced within individuals and are not synchronized across in-
dividuals. Thus, the CONTENT data has a multivariate (multiple types of measurement),
functional (has all characteristics of functional data), sparse (few observations per curve
Basic Concepts 19
that are not synchronized across individuals), and unequally spaced (observations were not
taken at equal intervals within study participants). The CONTENT data is highly com-
plex and contains additional time invariant (e.g., sex) and time-varying observations (e.g.,
bacterial infections).
As the CD4 counts data presented in Section 1.2.3, the CONTENT data is at the
interface between traditional linear mixed effects models (LME) and functional data. While
both approaches can be used, this is an example when FDA approaches are more reasonable,
at least as an exploratory tool to understand the potential hidden complexity of individual
trajectories. In these situations, one also starts to question or even test the standard residual
dependence assumptions in traditional LMEs. In the end, we will show that every FDA is a
form of LME, but this will require some finesse and substantial methodological development.
Second, the density and number of observations at the study participant level could
vary substantially. Indeed, there could be as few as two or three to as many as hundreds
of millions of observations per study participant. Moreover, observations can be equally or
unequally spaced within and between study participants as well as when aggregated across
study participants. Each of these scenarios raises its own specific set of challenges.
Third, the complexity of individual and population trajectories is a priori unknown. Ex-
tracting information is thus a balancing act between model assumptions and signal structure
often in the presence of substantial noise. As shown in the examples in this chapter, func-
tional data are seldom linear and often non-stationary.
Fourth, the covariance structure within experimental units (e.g., study participants) re-
quires a new set of assumptions that cannot be directly extended from traditional statistical
models. For example, the independence and exchangeability assumptions from longitudinal
data analysis are, at best, suspect in many high-resolution FDA applications. The auto-
regressive assumption is probably way too restrictive, as well, because it implies stationarity
of residuals and an exponential decrease of correlation as a function of distance. Moreover,
as sampling points are getting closer together (higher resolution) the structure of correlation
may change substantially. The unstructured correlation assumption is more appropriate for
FDA, but it requires the estimation of a very large dimensional correlation matrix. This
can raise computational challenges for moderate to high-dimensional functions.
Fifth, observed data may be non-Gaussian with high skewness and thicker than normal
tails. While much is known about univariate modeling of such data, much more needs to
be done when the marginal distributions of functional data exhibit such behavior. Binary
or Poisson functional data raise their own specific sets of challenges.
To understand the richness of FDA, one could think of all problems in traditional data
analysis where some of the scalar observations are replaced with functional observations.
This requires new modeling and computational tools to accommodate the change of all
or some measurements from scalars to high-dimensional, highly structured multivariate
vectors, matrices or arrays. The goal of this book is to address these problems by providing
a class of self-contained, coherent analytic methods that are computationally friendly. To
achieve this goal, we need three important components: dimensionality reduction, penalized
smoothing, and unified regression modeling via mixed effects models inference. Chapter 2
will introduce these ideas and principles.
For illustration purposes, we display below the “wide format” data structure of
the NHANES physical activity data. This is stored in the variable MIMS of the data
nhanes fda with r. This NHANES data consists of a 12,610 × 1,440 matrix, with columns
containing MIMS measurements from 12:00 AM to 11:59 PM. Here we approximated the
MIMS up to the second decimal for illustration purposes, so the actual data may vary
slightly upon closer inspection. This data structure is familiar to many statisticians, and
can be useful in the implementation of specific methods, such as Functional Principal Com-
ponent Analysis (FPCA).
It is possible to use matrices for data that are somewhat less simple, although care is
required. When data can be observed over the same grid but are sparse for each subject,
a matrix with missing entries can be used. For the CD4 data, observations are recorded
at months before or after seroconversion. The observation grid is integers from −18 to 42,
but any specific participant is measured only at a subset of these values. Data like these
can be stored in a relatively sparse matrix, again with rows for study units and columns
for elements of the observation grid. Our data examples focus on equally spaced grids, but
this is not required for functional data in general or for the use of matrices to store these
observations.
For illustration purposes, we display the CD4 count data in the same “wide format” used
for NHANES. The structure is similar to that of NHANES data, where each row corresponds
to an individual and each column corresponds to a potential sampling point, in this case a
month from seroconversion. However, in the CD4 data example most observations are not
available, as indicated by the NA fields. Indeed, as we discussed, only 1,888 data points are
available out of the 366 × 61 = 22,326 entries of the matrix, or 8.5%. Having one look at the
data matrix and knowing that less than 10% of the entries are known, immediately creates
the idea that the matrix and the data are “sparse.” Note, however, that this concept refers
to the percent of non-missing entries into a matrix and not to the mathematical concept
of sparsity. In most of the book, “sparsity” will refer to matrix sparsity and not to the
mathematical concept of sparsity of a set.
Storing the CD4 in wide format is not a problem because the matrix is relatively small
and does not take that much memory. However, this format is not efficient and could
be extremely cumbersome when data matrices increase both in terms of number of rows
or columns. The number of columns can increase very quickly when the observations are
irregular across subjects and the union of sampling point across study participants is very
large. In the extreme, but commonly encountered, case when no two observations are taken
at exactly the same time, the number of columns of the matrix would be equal to the total
number of observations for all individuals. Additionally, observation grid values are not
directly accessible, and must be stored as column names or in a separate vector.
Using the “long format” for sparse functional data can address some disadvantages that
are associated with the “wide format.” In particular, a data matrix or frame with columns
for study unit ID, observation grid point, and measurement value can be used for dense or
sparse data and for regular or irregular observation grids, and makes the observation grid
explicit. Below we show the CD4 counts data in “long format,” where all the missing data
are no longer included. The price to pay is that we add the column ID, which contains many
repetitions, while the column time also contains some repetitions to explicitly indicate the
month where the sample was taken.
The long format of the data is much more memory efficient when data are sparse,
though these advantages can disappear or become disadvantages when data become denser.
For example, when the observation grid is common across subjects and there are many
observations for each study participant, the ID and time column require substantial addi-
tional memory without providing additional information. Long format data may also repeat
subject-level covariates for each element of the observation grid, which further exacerbates
memory requirements. Moreover, complexity and memory allocation can increase substan-
tially when multiple functional variables are observed on different observation grids. From
a practical perspective, different software implementations require different data structures,
which can be a reason for frustration. In general refund tends to use the wide format of
the data, whereas our implementation of FDA in mgcv often uses the long format.
Given these considerations, we will use both the wide and long formats and we will
discuss when and how we make the transition between these formats. We recognize the
increased popularity of the tidyverse for visualization and exploratory data analysis, which
prefers the long format of the data. Over the last several years, many R users have gravitated
toward data frames for data storage. This shift has been facilitated by (and arguably is
Basic Concepts 23
attributable to) the development of the tidyverse collection of packages, which implement
general-purpose tools for data manipulation, visualization, and analysis.
The tidyfun [261] R package was developed to address issues that arise in the storage,
manipulation, and visualization of functional data. Beginning from the conceptual perspec-
tive that a complete curve is the basic unit of analysis, tidyfun introduces a data type
("tf") that represents and operates on functional data in a way that is analogous to nu-
meric data. This allows functional data to easily sit alongside other (scalar or functional)
observations in a data frame in a way that is integrated with a tidyverse-centric approach to
manipulation, exploratory analysis, and visualization. Where possible, tidyfun conserves
memory by avoiding data duplication.
We will use both the tidyverse and the usualverse (completely made up word) and
we will point out the various approaches to handling the data. In the end, it is a personal
choice of what tools to use, as long as the main inferential engine works.
One can reasonably ask why a book of methods places such an emphasis on data struc-
tures? The reason is that this is a book on “functional data analysis with R” and not a book
on “functional data analysis without R.” Thus, in addition to methods and inference we
emphasize the practical implementation of methods and the combination of data structures,
code, and methods that is amenable to software development.
1.5 Notation
Throughout the book we will attempt to use notation that is consistent across chapters.
This will not be easy or perfect, as functional data analysis can test the limits of reasonable
notation. Indeed, the Latin and Greek alphabet using lower- and uppercase, bold and regular
font were heavily tested by the data structures discussed in this book. To provide some order
ahead of starting the book in earnest we introduce the following notation.
• Xi (sj ), Xi (sij ), Xim (·): same as Wi (sj ), Wi (sij ), Wim (·), but for the underlying, unob-
served, functional process
In this chapter we introduce some of the key methodological concepts that will be used
extensively throughout the book. Each method is important in itself, but it is the specific
combination of these methods that provides a coherent infrastructure for FDA inference
and software development. Understanding the details of each approach is not essential for
the application of these methods. Readers who are less interested in a deep dive into these
methods and more interested in applying them can skip this chapter for now.
#SVD of matrix W
SVD of W <- svd(W)
#PCA of matrix W
PCA of W <- princomp(W)
25
26 Functional Data Analysis with R
This provides an explicit linear decomposition of the data in terms of the functions, {vk (sj ) :
j = 1, . . . , p}, which are the columns of V and form an orthonormal basis in Rp . These right
singular vectors are often referred to as the main directions of variation in the functional
space. Because vk are orthonormal, the coefficients of this decomposition can be obtained
as
p
dk uik = Wi (sj )vk (sj ) .
j=1
Thus, dk uik is the inner product between the ith row of W (the data for study participant
i) and the kth column of V (the kth principal direction of variation in functional space).
We will show that {d2k : k = 1, . . . , K} quantify the variability of the observed data
explained by the vectors {vk (sj ) : j = 1, . . . , p} for k = 1, . . . , K. The total variance of the
original data is
n p n p
1
{Wraw,i (sj ) − W raw (sj )}2 = Wi2 (sj ) ,
np i=1 j=1 i=1 j=1
which is equal to tr(Wt W) = tr(VΣt Ut UΣVt ), where tr(A) denotes the trace of matrix
A. As Ut U = In , tr(Wt W) = tr(VΣt ΣVt ) = tr(Σt ΣVt V), where we used the property
K
that tr(AB) = tr(BA) for A = V and B = Σt ΣVt . As Vt V = Ip and Σt Σ = k=1 d2k , it
follows that
n p K
Wi2 (sj ) = d2k , (2.2)
i=1 j=1 k=1
indicating that the total variance is equal to the sum of squares ofthe singular values.
K0
In practice, for every s ∈ S, Wi (s) is often approximated by k=1 dk uik vk (s) that is,
by the first K0 right singular vectors, where 0 ≤ K0 ≤ K. We now quantify the variance
explained by these K0 right singular vectors. Denote by V = [VK0 |V−K0 ] the partition of
V in the p × K0 dimensional sub-matrix VK0 and p × (p − K0 ) dimensional sub-matrix
V−K0 containing the first K0 and the last (p − K0 ) columns of V, respectively. Similarly,
denote by ΣK0 and Σ−K0 the sub-matrices of Σ that correspond to the first K0 and last
t t
(K −K0 ) singular values, respectively. With this notation, W = UΣK0 VK 0
+UΣ−K0 V−K 0
t t
or, equivalently, W − UΣK0 VK0 = UΣ−K0 V−K0 . Using a similar argument to the one
for the decomposition of the total variation, we obtain tr(V−K0 Σt−K0 Ut UΣ−K0 V−K t
)=
K 2
0
d
k=K0 +1 k . Therefore,
K
t
tr(W − UΣK0 VK 0
)t (W − UΣK0 VK
t
0
)= d2k .
k=K0 +1
Key Methodological Concepts 27
p
n K0
K
{Wi (sj ) − dk uik vk (sj )}2 = d2k . (2.3)
i=1 j=1 k=1 k=K0 +1
Equations (2.2) and (2.3) indicate that the first K0 right singular vectors of W explain
K 0 2 K 0 2 K 2
k=1 dk of the total variance of the data, or a fraction equal to k=1 dk / k=1 dk . In
2
many applications dk decrease quickly with k indicating that only a few vk (·) functions are
enough to capture the variability in the observed data.
It can also be shown that for every K0 = 1, . . . , K
p
n
{Wi (sj ) − dk uik vk (sj )}2 = d2K0 , (2.4)
i=1 j=1 k=K0
where the sum over k = K0 is over all k = 1, . . . , K, except K0 . Thus, the KK0 th right
singular vector explains d2K0 of the total variance, or a fraction equal to d2K0 / k=1 d2k . The
proof is similar to the one for equation (2.3), but partitions the matrix V into a sub-matrix
that contains its K0 column vector and a sub-matrix that contains all its other columns.
In summary, equation (2.1) can be rewritten for every s ∈ S as
K0
K
Wi (s) = dk uik vk (s) + dk uik vk (s) , (2.5)
k=1 k=K0 +1
K0 K
where k=1 dk uik vk (s) is the approximation of Wi (s) and k=K0 +1 dk uik vk (s) is the ap-
K
proximation error with variance equal to k=K0 +1 d2k . The number K0 is typically chosen
to explain a given fraction of the total variance of the data, but other criteria could be used.
We now provide the matrix equivalent of the approximation in equation (2.5). Recall
that Wi (sj ) is the (i, j)th entry of the matrix W. If uk and vk denote the left and right
singular vectors of W, the (i, j) entry of the matrix uk vkt is equal to uik vk (sj ). Therefore,
the matrix format of equation (2.5) is
K0
K
W= dk uk vkt + dk uk vkt . (2.6)
k=1 k=K0 +1
K0
The matrix k=1 dk uk vkt is called the rank K0 approximation of W.
Once v1 is known, the projection of the data matrix on v1 is A1 v1 and the residual
variation in the data is W − A1 v1t , where A1 is an n × p dimensional matrix. Because
vk are orthonormal,K it can be shown that A1 = Wv1 and the unexplained variation is
W − Wv1 v1t = k=2 λk vk vkt . Iterating with W − Wv1 v1t instead of W, we obtain that
the second eigenfunction, v2 , maximizes the residual variance after accounting for v1 . The
process is then iterated.
PCA and SVD are closely connected, as Wt W = VΣt ΣVt . Thus, if d2k are ordered such
that d21 ≥ . . . ≥ d2K ≥ 0, the kth right singular vector of W is equal to the kth eigenvector
of Wt W and corresponds to the kth eigenvalue λk = d2k . Similarly, WWt = UΣΣt U,
indicating that the kth left singular vector of W is equal to the kth eigenvector of WWt
and corresponds to the kth eigenvalue λk = d2k .
SVD and PCA have been developed for multivariate data and can be applied to func-
tional data. There are, however, some specific considerations that apply to FDA: (1)
the data Wi (s) are functions of s and are expressed in the same units for all s; (2)
the mean function, W (s), and the main directions of variation in the functional space,
vk = {vk (sj ), j = 1, . . . , p}, are functions of s ∈ S; (3) these functions inherit and abide
by the rules induced by the organization of the space in S (e.g., they do not change too
much for small variations in s); (4) the correlation structure between Wi (s) and Wi (s ) may
depend on (s, s ); and (5) the data may be observed with noise, which may substantially
affect the calculation and interpretation of {vk (sj ), j = 1, . . . , p}. For these reasons, FDA
often uses smoothing assumptions on Wi (·), W (·) and vk (·). These smoothing assumptions
provide a different flavor to PCA and SVD and give rise to functional PCA (FPCA) and
SVD (FSVD). While FPCA is better known in FDA, FSVD is a powerful technique that
is indispensable for higher dimensional (large p) applications. A more in-depth look at
smoothing in FDA is provided in Section 2.3.
obtained by multiplying Wt with the corresponding column of UΣ−1 . This requires O(n2 p)
operations. As, in general, we are only interested in the first K0 columns of V, the total
number of operations is of the order O(n2 pK0 ). Moreover, the operations do not require
loading the entire data set in the computer memory. Indeed, Wt UΣ−1 can be done by
loading one 1 × n dimensional row of Wt at a time.
The essential idea of this computational trick is to replace the diagonalization of the
large p × p dimensional matrix Wt W with the diagonalization of the much smaller n ×
n dimensional matrix WWt . When n is also large, this trick does not work. A simple
solution to address this problem is to sub-sample the rows of the matrix W to a tractable
sample size, say 2000. Sub-sampling can be repeated and right singular vectors can be
averaged across sub-samples. Other solutions include incremental, or streaming, approaches
[133, 203, 219, 285] and the power method [67, 158].
The incremental, or streaming, approaches start with a number of rows of W that
can be handled computationally. Then covariance operators, eigenvectors, and eigenvalues
are updated as new rows are added to the matrix W. The power method starts with the
n × n dimensional matrix A = WWt and an n × 1 dimensional random normal vector u0 ,
which is normalized u0 ← u0 /||u0 ||. Here ||a|| = (at a)1/2 is the norm induced by the inner
product in Rn . The power method consists of calculating the updates ur+1 ← Aur and
ur+1 ← ur+1 /||ur+1 ||. Under mild conditions, this approach yields the first eigenfunction,
v1 , which can be subtracted and the method can be iterated to obtain the subsequent
eigenfunctions. The computational trick here is that diagonalization of matrices is replaced
by matrix multiplications, which are much more computationally efficient.
We have found that sampling is a very powerful, easy-to-use method and we recommend
it as a first line approach in cases when both n and p are very large.
FIGURE 2.1: Each line represents the cumulative excess mortality for each state and two
territories in the US. The mean cumulative excess mortality in the US per one million
residents is shown as a dark red line.
vector d, and the right singular vectors, V, are stored as columns in the matrix V.
Table 2.1 presents the individual and cumulative percent variance explained by the first
five right singular vectors. The first two right singular vectors explain 84% and 11.9% of
the variance, respectively, for a total of 95.9%. The first five right singular vectors explain
Key Methodological Concepts 31
FIGURE 2.2: Each line represents the centered cumulative excess mortality for each state
in the US. Centered means that the average at every time point is equal to zero. Five states
are emphasized: New Jersey (green), Louisiana (red), Maryland(blue), Texas (salmon), and
California (plum).
a cumulative 99.7%, indicating that dimension reduction is quite effective in this particular
example. Recall that the right singular vectors are the functional principal components.
The next step is to visualize the two right singular vectors, which together explain 95.9%
of the variability. These are the vectors V[,1] and V[,2] in R notation and v1 and v2 in
statistical notation. Figure 2.3 displays the first (light coral) and second (dark coral) right
singular vectors. The interpretation of the first right singular vector is that the mortality
data for a state that has a positive coefficient (score) tends to (1) be closer to the US mean
between January and April; (2) have a sharp increase above the US mean between April
and June; and (3) be larger with a constant difference from the US mean between July
and December. The mortality data for a state that has a positive coefficient on the second
right singular vector tends to (1) have an even sharper increase between April and June
TABLE 2.1
All-cause cumulative excess mortality in 50 US states plus
Puerto Rico and District of Columbia. Individual and
cumulative percent variance explained by the first five right
singular vectors (principal components).
Right singular vectors
Variance 1 2 3 4 5
Individual (%) 84.0% 11.9% 2.9% 0.6% 0.3%
Cumulative (%) 84.0% 95.9% 98.8% 99.4% 99.7%
32 Functional Data Analysis with R
FIGURE 2.3: First two right singular vectors (principal components) for all-cause weekly
excess US mortality data in 2020. First right singular vector: light coral. Second singular
vector: dark coral.
relative to the US average; and (2) exhibit a decreased difference from the US mean as time
progresses from July to December. Of course, things are more complex, as the mean and
right singular vectors can compensate for one another in specific times of the year.
Individual state mortality data can be reconstructed for all states simultaneously. A
K0 = 2 rank reconstruction of the data can be obtained as
FIGURE 2.4: All-cause excess mortality (solid lines) and predictions based on rank 2 SVD
(dashed lines) for five states in the US: New Jersey (green), Louisiana (red), Maryland
(blue), Texas (salmon), and California (plum).
where the coefficients 0.49 and 0.25 correspond to ui1 and ui2 , the (i, 1) and (i, 2) entries of
the matrix U (U in R), where i corresponds to New Jersey. These values can be calculated
in R as
where states is the vector containing the names of US states and territories. We have used
the notation W US (s) instead of W (s) and WNJ (s) instead of Wi (s) to improve the precision
of notation. Both coefficients for v1 (·) and v2 (·) are positive, indicating that for New Jersey
there was a strong increase in mortality between April and June, a much slower increase
between June and November and a further larger increase in December. Even though neither
of the two components contained information about the increase in mortality in December,
the effect was accounted for by the mean; see, for the example the increase in the November
December period in the mean in Figure 2.1.
All the coefficients, also known as scores, are stored in the matrix U. It is customary to
display these scores using scatter plots. For example,
plot(U[,1], U[,2])
produces a plot similar to the one shown in Figure 2.5. Every point in this graph represents
a state and the same five states were emphasized: New Jersey (green), Louisiana (red),
Maryland (blue), Texas (salmon), and California (plum). Note that New Jersey is the point
34 Functional Data Analysis with R
FIGURE 2.5: Scores on the first versus second right singular vectors for all-cause weekly
excess mortality in the US. Each dot is a state, Puerto Rico, or Washington DC. Five states
are emphasized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and
California (plum).
with the largest score on the first right singular vector and the third largest score on the
second right singular vector. Louisiana has the third largest score on the first right singular
vector, which is consistent with being among the states with highest all-cause mortality.
In contrast to New Jersey, the score for Louisiana on the second right singular vector is
negative indicating that its cumulative mortality data continues to increase away from the
US mean between May and November; see Figure 2.2.
The Kosambi-Karhunen-Loève (KKL) [143, 157, 184] theorem provides the explicit de-
composition of the process W (s). Because φk (t) form an orthonormal basis, the Gaussian
Process can be expanded as
∞
W (s) = ξk φk (s) ,
k=1
1
where ξk = 0
W (s)φk (s)dt, which does not depend on s. It is easy to show that the
E(ξk ) = 0 as
1 1
E(ξk ) = E{ W (s)φk (s)ds} = E{W (s)}φk (s)ds = 0 .
0 0
We can also show that the Cov(ξk , ξl ) = E(ξk ξl ) = 0 for k = l and Var(ξk ) = λk . The proof
is shown below
1 1
E(ξk ξl ) = E W (s)W (t)φk (s)φl (t)dtds
0 0
1 1
= E{W (s)W (t)}φk (t)φl (s)dtds
0 0
1 1
(2.8)
= KW (s, t)φk (t)dt φl (s)ds
0 0
1
= λk φk (s)φl (s)ds
0
= λk δkl ,
where δkl = 0 if k = l and 1 otherwise. The second equality holds because of the change
of order of integrals (expectations), the third equality holds because of the definition of
KW (s, t), the fourth equality holds because φk (s) is the eigenfunction of KW (·, ·) corre-
sponding to the eigenvalue λk , and the fifth equality holds because of the orthonormality
of the φk (s) functions. These results hold for any L2 [0, 1] integrable process and does not
require Gaussianity of the scores.
However, if the process is Gaussian, it can be shown that any finite collection
{ξk1 , . . . , ξkl } is jointly Gaussian. Because the individual entries are uncorrelated mean-
zero, the scores are independent Gaussian random variables. One could reasonably ask,
why should one care about all these properties and whether this theory has any practical
implications. Below we identify some of the practical implications.
The expression “Gaussian Process” is quite intimidating, the definition is relatively tech-
nical, and it is not clear from the definition that such objects even exist. However, these
results show how to generate Gaussian Processes relatively easily. Indeed, the only ingredi-
ents we need are a set of orthonormal functions
√ φk (·) in L2 [0, 1]√
and a set of positive numbers
λ1 ≥ λ2 ≥ . . .. For example, if φ1 (s) = 2 sin(2πs), φ2 (s) = 2 cos(2πs), λ1 = 4, λ2 = 1,
36 Functional Data Analysis with R
yi = f (xi ) + i , (2.9)
Key Methodological Concepts 37
where i are independent identically distributed N (0, σ2 ) random variables. We denote by
y = (y1 , . . . , yn )t and by f = {f (x1 ), . . . , f (xn )}. Here f (·) has either a specified parametric
form, such as the linear parametric function f (xi ) = β0 + β1 xi , or an unspecified nonpara-
metric form with specific restrictions. Note that without any restrictions on f (·) the model
is not identifiable and therefore unusable. Many different restrictions have been proposed
for f (·) and almost all of them assume some degree of smoothness and/or continuity of the
function. Despite the large number of approaches, all practical solutions have the follow-
ing form f = Sy, where y is the n × 1 dimensional vector with the ith entry equal to yi
and S is a n × n dimensional symmetric smoother matrix. The residual sum of squares for
approximating y by f = Sy is
RSS = ||y −
f ||2 = ||y − Sy||2 ,
where ||a||2 is the sum of squares of the entries of the vector a. The minimum RSS is zero
and is obtained when f = y, when no restrictions are imposed on the function f (·). However,
this is less interesting and smoothing is concerned with minimizing a version of the RSS
when specific restrictions are imposed on f (·).
Here B1 (x) = 1 and B2 (x) = x are basis functions. More flexible models are obtained by
adding polynomial and/or spline terms. For example, a quadratic truncated polynomial
regression spline has the form
K
f (xi ) = β1 + β2 xi + β3 x2i + βk+3 (xi − κk )2+ ,
k=1
where κ1 , . . . , κK are knots and a2+ is equal to a2 if a > 0 and 0 otherwise. For didactic
purposes, we used the quadratic truncated polynomial regression spline, though any spline
basis can be used. The smoother matrix has the same form, S = X(Xt X)−1 Xt , though X
has more columns
1 x1 x21 (x1 − κ1 )2+ . . . (x1 − κK )2+
1 x2 x22 (x2 − κ1 )2+ . . . (x2 − κK )2+
X = . .. .. .. .. .. .
.. . . . . .
1 xn x2n (xn − κ1 )2+ ... (xn − κK )2+
38 Functional Data Analysis with R
If β = (β1 , . . . , βK+3 )t and = (1 , . . . , n ), the model has the following matrix format y =
Xβ + . The BLUP for y is f = Sy = Xβ, where S = X(Xt X)−1 X and β = (Xt X)−1 Xt y
is the solution to the minimizing problem
Splines fit using minimization of the residual sum of squares (2.10) are referred to as regres-
sion splines. The type of basis (e.g., quadratic truncated polynomial) is used to define the
type of spline, whereas the term “regression spline” refers to the method used for fitting. In
Section 2.3.2 we will describe penalized splines, which use the same basis, but a different,
penalized minimization criterion.
yi = f (xi1 , xi2 ) + i .
The observed data consists of the n × 1 dimensional vector y = {y1 , . . . , yn }t and the
corresponding covariates, x1 = (x11 , x12 ), . . . , xn = (xn1 , xn2 ). The problem is to estimate
the bivariate function f (·, ·). Bivariate spline models are of the type
K
f (x) = βk Bk (x) ,
k=1
where Bk (·) is a basis in R2 . Examples of such bases include tensor products of univariate
splines and thin-plate splines [258]. The bases for tensor products of univariate splines are of
the form Bk1 ,k2 (x1 , x2 ) = Bk1 ,1 (x1 )Bk2 ,2 (x2 ) for k1 = 1, . . . , K1 and k2 = 1, . . . , K2 , where
K1 and K2 are the number of bases in the first and second dimension, respectively. In this
notation the total number of basis functions is K = K1 K2 and
K1
K2
f (xi1 , xi2 ) = βk1 k2 Bk1 ,1 (xi1 )Bk2 ,2 (xi2 ) .
k1 =1 k2 =1
B1,1 (x11 )B1,2 (x12 ) B1,1 (x11 )B2,2 (x12 ) ... BK1 ,1 (x11 )BK2 ,2 (x12 )
B1,1 (x21 )B1,2 (x22 ) B1,1 (x21 )B2,2 (x22 ) ... BK1 ,1 (x21 )BK2 ,2 (x22 )
X= .. .. .. .. ,
. . . .
B1,1 (xn1 )B1,2 (xn2 ) B1,1 (xn1 )B2,2 (xn2 ) ... BK1 ,1 (xn1 )BK2 ,2 (xn2 )
Thin plate spline bases are constructed a bit differently and require a set of points (knots)
in space, say κ1 , . . . , κK ∈ R2 and the function ϕ : [0, ∞] → R, ϕ(r) = r2 log(r). The thin
plate spline basis is B1 (x) = 1, B2 (x) = x1 , B3 (x) = x2 , and Bk+3 (x) = ϕ(||x − κk ||), for
k = 1, . . . , K. The corresponding design matrix is
B1 (x1 ) B2 (x1 ) . . . BK+3 (x1 )
B1 (x2 ) B2 (x2 ) . . . BK+3 (x2 )
X= . .. .. .. ,
.. . . .
B1 (xn ) B2 (xn ) . . . BK+3 (xn )
and the model parameter is β = (β1 , . . . , βK+3 ). Thin plate splines are a type of spline
smoother with a radial basis. Here we acknowledge the more general class of radial
smoothers, but we focus on thin plate splines. For thin plate splines we used K + 3 in-
stead of K to emphasize the special nature of the intercept, and first and second coordinate
bases.
In summary, just as in the univariate case, the multivariate regression spline model
has the following matrix format y = Xβ + . The best predictor for y is f = Xβ, where
β minimizes the residual sum of squares criterion (2.10). The difference between tensor
products of splines and thin plate splines is that the knots for tensor products are arranged
in a rectangular shape. This makes them better suited for fitting rectangular or close to
rectangular surfaces. Thin plate splines can better adapt to irregular surfaces, as knots can
be placed where observations are. For standard regression to work one needs the matrix
Xt X to be invertible. A minimum requirement for that is for the number of observations,
n, to exceed the number of columns in matrix X, as was discussed in Section 2.3.1.2. For
example, the number of columns in X using a tensor product of splines is K1 K2 , which is
much larger than K1 +K2 +2 in the case discussed in Section 2.3.1.2. In general, the number
of columns in the design matrix grows much faster when we consider smoothing in higher
dimensions. While imperfect, it is a good rule to have at least n > 5K1 K2 observations
when running a tensor product regression. When the number of parameters exceeds the
number of observations our strategy will be to use penalized approaches, which impose
smoothing on parameters. The next section provides the necessary details. But, for now,
remember that every regression spline model is a standard regression. It can be easily
extended to non-Gaussian outcomes by simply changing the distribution of the error, i .
From an implementation perspective, one simply replaces the lm function with the glm
function in R.
where y = (y1 , . . . , yn )t , the columns of the matrix X correspond to the spline basis, β are
the parameters of the spline, and = (1 , . . . , n )t . As discussed in Section 2.3, estimating
the model is equivalent to minimizing the sum of squares of residuals criterion described in
equation (2.10).
The problem with this approach is that it is hard to know how many basis functions
are enough to capture the complexity of the underlying mean function. One option is to
estimate the number of knots, but we have found this idea quite impractical especially in
the context when we might have additional covariates, nonparametric components, and/or
non-Gaussian data. An idea that works much better is to consider a rich spline basis and
add a quadratic penalty on the spline coefficients in equation (2.10). This is equivalent to
minimizing the following penalized sum of squares criterion
min ||y − Xβ||2 + λβ t Dβ , (2.11)
β
where λ ≥ 0 is a scalar and D is a matrix that provides the penalty structure for a specific
choice of spline basis. For example, the penalized quadratic truncated polynomial splines
use the penalty matrix
03×3 03×K
D= ,
0K×3 IK
where 0a×b is a matrix of zero entries with a rows and b columns and IK is the identity
K+3
matrix of dimension K. In this example, the penalty is equal to λ k=4 βk2 and leaves the
parameters β1 , β2 , β3 unpenalized.
For every fixed λ, by setting the derivative with respect to β equal to zero in expres-
sion (2.11), it can be shown that the minimum is achieved at βλ = (Xt X + λD)−1 Xt y.
Thus, the predictor of y is
fλ = Sλ y, where
Sλ = X(Xt X + λD)−1 Xt . (2.12)
The scalar parameter λ controls the amount of smoothing, which varies from the saturated
parametric model when λ = 0 (no penalty) to a parsimonious parametric model when λ = ∞
(e.g., the quadratic regression model in the penalized truncated polynomial case.) The trace
of the Sλ matrix is referred to as the number of degrees of freedom of the smoother.
With this approach the smoothing problem is reduced to estimating λ. Some of the most
popular approaches for estimating λ are cross-validation (CV) [160, 209], generalized cross
validation (GCV) [54], Akaike’s Information Criterion (AIC) [3] and restricted maximum
likelihood (REML) [121, 228]. We describe the first three approaches here and the REML
approach in Section 2.3.3.
Cross-validation estimates λ as the value that minimizes CV(λ) = n1 {yi − fλ,−i (xi )}2 ,
where fλ,−i (xi ) is the estimator of yi based on the entire data, except (yi , xi ). It can be
shown that this formula can be simplified to
n
2
1 yi − fλ (xi )
CV(λ) = ,
n i=1 1 − Sλ,ii
where fλ (xi ) is the estimator of yi based on the entire data and Sλ,ii is the ith diagonal
entry of the smoother matrix Sλ . The advantage of this formula is that it requires only
one regression for each λ using the entire data set. The original formulation would require
n regressions, each with a different data set. Generalized cross validation (GCV) further
simplifies the CV(λ) formula by replacing Sλ,ii by tr(Sλ )/n. With this replacement,
1
GCV(λ) = 2
{yi − fλ (xi )}2 .
n{1 − tr(Sλ )/n}
42 Functional Data Analysis with R
Here, the last term, log(n), can be ignored, as this is a constant. For large n, log(1−x/n)n ≈
−x, indicating that the second term in the log{GCV(λ)} formula can be approximated by
2tr(Sλ )/n.
All these methods apply without any change to multivariate thin plate splines. For tensor
products of splines, things are a bit different. A standard, though not unique, strategy is to
parameterize βk1 ,k2 = bk1 ck2 , which transforms the function to
K1
K2
f (xi1 , xi2 ) = { bk1 Bk1 ,1 (xi1 )}{ ck2 Bk2 ,2 (xi2 )} .
k1 =1 k2 =1
In this case, the parameter vector is K1 + K2 dimensional β = (b1 , . . . , bK1 , c1 , . . . , cK2 ) and
one penalty could be β t Dβ , where
λ1 D1 0K1 ×K2
D= .
0K2 ×K1 λ2 D2
The penalties λ1 D1 and λ2 D2 are the univariate “row” and “column” penalties, respectively.
The advantage of this approach is that it combines two univariate smoothers into a bivariate
smoother. The disadvantage is that it requires two smoothing parameters, which increases
computational complexity.
As we have seen in Sections 2.3.1.2, we could have multiple covariates that we want to
model nonparametrically. The data structure is (yi , zi , xi1 , . . . , xiq ) and we would like to fit
a model of the type
Q
y i = γ 0 + γ 1 zi + fq (xiq ) + i ,
q=1
where fq (·) are functions that are modeled as splines. As we have discussed, this spline
model can be written in matrix format as
Q
y = Zγ + Xq β q + ,
q=1
where the ith row of Z is (1, zi ), γ = (γ0 , γ1 )t , and Xq and βq are the spline design matrix
and coefficients for the function fq (·), respectively. Just as in the case of univariate penalized
spline smoothing, one can control the roughness of the functions fq (·) by minimizing the
penalized criterion
Q Q
||y − Zγ − Xq βq ||2 + λq βqt Dq βq ,
q=1 q=1
where λq ≥ 0 is a scalar and Dq is the matrix that provides the penalty structure for function
fq (·). Note that the parameters γ are not penalized, but for consistency, one could add the
Key Methodological Concepts 43
Of course, the parameter λ0 is not identifiable, but the expression is useful for under-
standing the effects of penalization or lack thereof on the optimization criterion. The model
expression can be made even more compact if we denote X = [Z|X1 | . . . |Xq ], the matrix
obtained by column binding these matrices, by β = {γ t , β1t , . . . , βQ
t t
} and by
λ0 D0 0 0 ... 0
0 λ1 D 1 0 . . . 0
Dλ = ...
,
... ... ... ...
0 ... . . . 0 λQ DQ
where 0 is a generic matrix of zeros with the dimensions conforming to each entry. With
this notation the penalized criterion can be rewritten as
The only difference is that the design matrix depends now on more than one smoothing
parameter λ = (λ0 , λ1 , . . . , λQ ), but all criteria for estimating the smoothing parameters
in the case of one smoothing parameter can be used to estimate the vector of smoothing
parameters.
The most elegant part of this is that we can seamlessly combine penalized and non-
penalized parameters in the same notation. It also becomes clear that parametric and
nonparametric components can be integrated and considered together. Adding bivariate
or multivariate smoothing follows the exact same principles discussed here.
Thus, for a fixed σ2 and σθ2 , the solution that maximizes (2.15) is the best linear unbiased
predictor in the model
[y|β, σ2 ] = N (Zγ + Q 2
q=1 Xq βq , σ Ip ) ;
(2.16)
det(Dq )1/2 βqt Dq βq
[βq |σq2 ] = exp − , for q = 1, . . . , Q ,
(2π)Kq /2 σq 2σq2
where the notation [y|x] denotes the conditional probability density function (pdf) of y
given x and Kq is the dimension of the square penalty matrix Dq . This indicates that
βq can be viewed as random effects in a specific mixed effects model, where σ2 , σq2 and,
implicitly, λq = σ2 /σq2 , for q = 1, . . . , Q can be estimated using the usual mixed effects model
44 Functional Data Analysis with R
FIGURE 2.6: Raw averages (black dots) and corresponding smooth averages of physical
activity data at every minute of the day in the NHANES study. Data are separate in two
groups: deceased (blue line) and alive (red line) individuals as of December 31, 2019. The
smooth averages were obtained by smoothing the raw averages and not the original data.
The x-axis is expressed as time from midnight to midnight and the y-axis is expressed in
MIMS.
where fgroup (sj ) is modeled using a penalized cyclic cubic regression spline with 30 knots
placed at the quantiles of the minutes. Because minutes are evenly spaced, the knots are
evenly spaced. The function gam in the mgcv package can be used to provide both fits. As
an example, consider the case when the average activity data for individuals who are alive
is contained in the variable nhanes mean alive at the times contained in variable minute.
Both vectors are of length 1440, the number of minutes in a day. The code below shows
how to fit the model and extract the smooth estimators. Notice that here we use REML to
select the scalar parameter λ by specifying method = "REML".
#Fit penalized cyclical splines with 30 knots using the REML criterion
MIMS sm alive <- gam(nhanes mean alive ∼ s(minute, bs = "cc", k = 30),
method = "REML")
#Obtain the smooth estimator
pred sm alive <- MIMS sm alive$fitted.values
46 Functional Data Analysis with R
FIGURE 2.7: Upper panels: scatter plots of age and BMI versus average MIMS across days
and within days. Each dot corresponds to a study participant. Lower panels: estimated
smooth associations (black solid lines) between age (left) and BMI (right) and average
MIMS. The 95% pointwise confidence intervals are shown in dark gray, while the 95%
correlation and multiplicity adjusted confidence intervals are shown in light gray.
where the f (·) and g(·) are modeled using penalized splines. If the NHANES data set is
stored in the data frame nhaned df, the R code for fitting this model is provided below.
The syntax is similar to the structure used in the standard glm functions, though the
nonparametric components are now specified via s(age, k = 30) and s(BMI, k = 30).
These indicate that both functions are modeled using thin-plate regression splines with 30
knots placed at the quantiles of age and BMI variables, respectively. There are many ways
to extract estimates from the fitted object fit mims. One way is to leverage the plot.gam
function, which extracts the mean and standard error for each predictor. The code to obtain
estimates for age effect is shown below.
Here we specify select = 0 in the plot.gam function, so that it will not actually
return the plot. The estimated s(age, k = 30) function at the functional arguments
plot fit[[1]]$x is stored as plot fit[[1]]$fit. The 95% pointwise confidence inter-
val can be constructed by adding and subtracting plot fit[[1]]$se, which contains the
standard error of the mean already multiplied by 2. For BMI, the same procedure is applied
but plot fit[[1]] is replaced by plot fit[[2]].
The estimated effects of age and BMI on MIMS are shown in the lower panels of Fig-
ure 2.7. The estimated smooth associations are shown using black solid lines, with dark
gray region representing 95% pointwise confidence intervals. In addition, the light gray re-
gion represents 95% correlation and multiplicity adjusted (CMA) confidence intervals; for
more details see Section 2.4. Results indicate that average physical activity increase slightly
between age 20 and 40 with a strong decline after age 60. Physical activity is also higher
for BMI between 22 and 27 with lower levels both for lower and higher BMI levels. If the
outcome is binomial or Poisson, we can simply add family = binomial() or family =
poisson() to the gam function.
yi = f (xi ) + i
(2.17)
= Xti β + i ,
for i = 1, . . . , n, where Xti = {B1 (xi ), . . . , BK (xi )} is the vector of basis functions evaluated
at xi and i are independent random variables that are typically assumed to have a N (0, σ2 )
distribution. If the model is enriched with additional covariates, nonlinear functions of other
covariates, or multivariate functions, the row vector Xti becomes more complex, but the
structure of the problem remains unchanged.
Suppose that we are interested in constructing confidence intervals for the n × 1 di-
mensional vector f = {f (x1 ), . . . , f (xn )}t , where, for notation simplicity, we considered the
same grid of points x1 , . . . , xn as that of the observed data. This need not be the case, and
the same exact procedure would apply to any grid of points.
48 Functional Data Analysis with R
As we have shown in Section 2.3, model (2.17) can be fit using either unpenalized or
penalized likelihoods and they provide a point estimator β and an estimator of the K × K
dimensional variance-covariance matrix V β = Var(
β). The point estimator of f(xi ) is thus
t
Xi β and the point estimator f of f can be written as
f = Xβ , (2.18)
where X is the n × K dimensional matrix with Xti as row i. For example, in the simplest,
unpenalized regression case, V β = σ 2 (Xt X)−1 and V
f = Var( f ) = σ2 X(Xt X)−1 Xt . For
the case of penalized regression, mixed effects models are used to produce similar variance
estimators for β and
f.
Under the assumption of joint normality of β, 100(1 − α)% confidence intervals that are
not adjusted for correlation or multiplicity can be obtained for β as
β ± z1−α/2 diag(V
β) ,
and for f as
f) .
f ± z1−α/2 diag(V
Here z1−α/2 is the 1 − α/2 quantile of a N (0, 1) distribution and diag(A) is the diagonal
vector of the symmetric matrix A.
These confidence intervals for f are not unadjusted for correlation and multiplicity.
Many books and manuscripts refer to these intervals as “pointwise” confidence intervals to
differentiate them from the “joint” confidence intervals that take into account the correlation
and multiplicity of the tests. Here we prefer the terms “unadjusted” instead of “pointwise”
and “correlation and multiplicity adjusted (CMA)” instead of “joint” confidence intervals.
Note that the CMA confidence intervals are calculated at every point and it would be
difficult to refer to them as “joint,” but at every point.
Some drawbacks of the unadjusted confidence intervals are that they (1) do not account
for the correlation among tests, which can be quite large given the inherent correlation of
functional data; (2) do not address the problem of testing multiplicity (whether or not the
confidence interval crosses zero at every time point along a function); and (3) cannot be
used directly to conduct tests of significance. To address these problems, we describe three
complementary procedures that can be used to construct α-level correlation and multiplicity
adjusted (CMA) confidence intervals that account for correlation. We consider the case
of univariate smoothing, but the same ideas apply more generally to semiparametric and
multivariate smoothing.
P {q(Cf , 1 − α) × e ≤ X ≤ q(Cf , 1 − α) × e} = 1 − α ,
where e = (1, . . . , 1)t is the n × 1 dimensional vector of ones, and X = (f − f )/Df ≈
N (0n , Cf ), we can obtain a CMA (1 − α) level confidence interval for f as
f ± q(Cf , 1 − α) × Df .
Luckily, the function qmvnorm in the R package mvtnorm [96, 97] does exactly that. To see
that, assume that f is stored in Vf. The code below describes how
f is stored in fhat and V
to obtain the correlation matrix Cf, the 1 − α quantiles q(Cf , 1 − α) for the joint distribu-
tion, and the lower and upper bounds of the correlation and multiplicity adjusted (CMA)
confidence interval.
#Code for calculating the joint CI for the mean of a multivariate Gaussian
#Calculate the standard error along the diagonal
Df <- sqrt(diag(Vf))
#Calculate the correlation matrix
Cf <- cov2cor(Vf)
#Obtain the critical value for the joint confidence interval
qCf alpha <- qmvnorm(1 - alpha, corr = Cf, tail = "both.tails")
#Obtain the upper and lower bounds of the joint CI
uCI joint <- fhat + qCf alpha * Df
lCi joint <- fhat - qCf alpha * Df
To build intuition about the differences between unadjusted and CMA confidence in-
tervals we will investigate the effect of number of tests and correlation between tests on
the critical values used for building confidence intervals. We consider the case when n tests
are conducted and the correlation matrix between tests has an exchangeable structure with
correlation ρ between any two tests. Thus, the correlation matrix has the following structure
1 ρ ρ ... ρ
ρ 1 ρ . . . ρ
Cf = . . . .. .. .
.. .. .. . .
ρ ρ ρ ... 1
Note that unadjusted two-sided confidence intervals would use a critical value of
z1−0.05/2 = 1.96, which is the 0.975 quantile of a N (0, 1) distribution. Using a Bonfer-
onni correction for n independent tests would replace this critical value with z1−0.05/(2n) .
The first column in Table 2.4.2 (corresponding to correlation ρ = 0) provides the Bonfer-
onni correction for n = 10 (2.80), n = 100 (3.48), and n = 200 (3.66) tests. These critical
values provide an upper bound on the correlation and multiplicity adjusted (CMA) critical
values. The second, third, and fourth columns correspond to increasing correlations be-
tween tests ρ = 0.25, 0.50, 0.75, respectively. Note that for small correlations (ρ = 0.25)
there is little difference between the Bonferonni correction and the CMA critical val-
ues. For example, when n = 100 and ρ = 0.25 the CMA critical value is 3.43 versus
the Bonferonni correction 3.48, a mere 1.5% difference. However, when the correlation is
moderate or large, the differences are more sizeable. For example, when n = 100 and
ρ = 0.75, the CMA critical value is 3.01 compared to 3.48, which corresponds to a 13.5%
difference.
50 Functional Data Analysis with R
TABLE 2.2
Critical values for correlation and
multiplicity adjusted α = 0.05 level tests
with an exchangeable correlation structure.
Correlation (ρ)
# tests (n) 0.00 0.25 0.50 0.75
n=10 2.80 2.78 2.72 2.57
n=100 3.48 3.43 3.30 3.01
n=200 3.66 3.61 3.44 3.12
In functional data, as observations are closer together, the tests become more correlated,
which in turn leads to larger differences between the Bonferonni and CMA critical values.
Indeed, the Bonferonni correction uses z1−0.05/(2n) , which slowly converges to infinity as
n → ∞. That means that if we do not account for correlation, the joint confidence intervals
have infinite length and are quite useless. In contrast, as the number of observations are
sampled more finely, the correlation between observations also increases, keeping the CMA
critical values meaningful.
There is some lack of clarity in the literature in terms of exactly what needs to be
correlated for the correlation to have an effect on the CMA critical values and confidence
intervals. A closer look at equation (2.20) shows that the correlation matrix Cf is the
correlation of the residuals
f − f and not of the true underlying function f . Thus, correlation
does not affect the joint confidence intervals when the observed data are correlated, but when
the residuals are correlated. This statement requires reflection as it is neither intuitive nor
universally understood.
Figure 2.7 displays the 95% CMA confidence intervals (shown in light gray) for the
smooth estimators of the associations between average PA and age and BMI. These confi-
dence intervals overlay the unadjusted 95% confidence intervals (shown in darker gray). In
this example, we used the procedure described in this section.
additional n operations for a total of 2Kn + K operations. This can be quite large when n is
large, but it is linear in n and works in many cases when the exact approach in Section 2.4.1
cannot be used.
To calculate this diagonal, we do not actually have to obtain the entire n×n dimensional
f . Indeed, the n × 1 dimensional vector diag(Vf ) has the ith entry equal to
matrix V
B
1 b
{f (xi ) − f (xi )}2 ,
B
b=1
This approach does not use the normality assumption of the vectors f b , but requires a
procedure that can produce these predictors. In this book, we focus primarily on nonpara-
metric bootstrap of units that are assumed to be independent to obtain these estimators.
Drawbacks of this approach include the need to fit the models multiple times, the reduced
precision of estimating low-probability quantiles, and unknown performance when confi-
dence intervals are asymmetric.
52 Functional Data Analysis with R
= 2P (X ≥ |f(xi )|/D
f,i )
where Φ(·) : R → [0, 1] is the cumulative distribution function of the N (0, 1) distribution. It
can be easily verified that this p-value is the minimum value of α such that the 1 − α level
confidence interval f(xi ) ± z1−α/2 D f,i does not contain zero (null hypothesis is rejected at
level α).
This is Statistics 101 and one could easily stop here. However, when we focus on a
function we conduct not one, but n tests and these tests are correlated. In Sections 2.4.1,
2.4.2, and 2.4.3 we have shown that we can construct a correlation and multiplicity ad-
justed (CMA) confidence interval f(xi ) ± q(Cf , 1 − α)D f,i . Just as in the case of the nor-
mal unadjusted pointwise confidence intervals, these confidence intervals can be calculated
for any value of α. For every xi we can find the largest value of α for which the confi-
dence interval does not include zero. We denote this probability by ppCMA (xi ) and refer
to it as the pointwise correlation and multiplicity adjusted (pointwise CMA) p-value. As
q(Cf , 1 − α) > z1−α/2 , the pointwise CMA confidence intervals are wider than the point-
wise unadjusted confidence intervals. Therefore, the pointwise CMA p-values will be larger
than pointwise unadjusted p-values and fewer observations will be deemed “statistically
significantly different from zero.” However, the tests will preserve the family-wise error rate
(FWER) while accounting for test correlations.
Similarly, we define the global pointwise correlation and multiplicity adjusted (global
CMA) p-value as the largest α level at which at least one confidence interval f(xi ) ±
q(Cf , 1 − α)D f,i , for i = 1, . . . , n does not contain zero. If we denote by pgCMA (x1 , . . . , xn )
this p-value, it can be shown that
because the null hypothesis is rejected if it is rejected at any point in the domain of f (·).
The advantage of using these p-values over the unadjusted p-values is that the tests
preserve their nominal level. The pointwise tests are focused on testing whether a particular
value of the function is zero, whereas the global tests focus on testing whether the entire
function is zero simultaneously.
Key Methodological Concepts 53
We will use the techniques described in this section throughout the book to obtain
CMA confidence intervals for high-dimensional functional parameters and conduct inference,
including tests of associations that account for the correlation of functional data and test
multiplicity.
Throughout this section we have used the nomenclature correlation and multiplicity
adjusted (CMA) to refer both to p-values and confidence intervals. This is the first time
this nomenclature is used and we hope that it will be adopted. The main reason for this
nomenclature is that it is precise and states explicitly that adjustments are for correlation
and multiplicity.
FIGURE 2.8: Simulated data (black curves) from the model (2.21), where the vectors v1
and v2 are orthonormal. Two individual trajectories are highlighted in blue and red.
for j = 1, . . . , p + 1, where b1i ∼ N (0, σ12 ), b2i ∼ N (0, σ22 ), and ij ∼ N (0, σ2 ) are mutually
independent. Here v1 (j) and v2 (j) are the jth entries of the vectors v1 = x1 /||x1 || and
v2 = x2 /||x2 ||, respectively, where ||x|| denotes the L2 norm of the vector x, x1 (j) = 1 and
x2 (j) = (j − 1)/p − 1/2, for j = 1, . . . , p + 1. This ensures that the vectors v1 and v2 are
orthonormal. We use p = 100 and generate n = 150 samples with σ12 = 4, σ22 = 1, and
σ2 = 1. Figure 2.8 displays the 150 functions simulated (black curves) and highlights two
trajectories in blue and red, respectively.
Model (2.21) is a particular case of the SVD, but it contains the additional term, ij ∼
N (0, σ2 ). We investigate what happens if we apply an SVD decomposition to the data
generated from a rank 2 SVD model with noise. Figure 2.9 displays the first two estimated
right singular vectors. The first right singular vector is shown as light coral, while the true
component (v1 ) is shown in dark blue. The second right singular vector is shown as dark
Key Methodological Concepts 55
FIGURE 2.9: The first two estimated right singular vectors of the n = 150 functions sim-
ulated from model (2.21) using SVD. The first right singular vector is shown as light coral
while the true component is shown in dark blue. The second right singular vector is shown
as dark coral while the true component is shown as light blue.
coral, while the true component (v1 ) is shown as light blue. Both estimated right singular
vectors exhibit substantial variability around the true components. This increases their
complexity (roughness) and reduces their interpretability. In this context, applying a linear
smoother either to the data or the right singular vectors addresses most of the observed
problems.
Figure 2.10 displays the proportion of variance explained by the estimated right singular
vectors. The slow decrease in variance explained is a tell-tale sign of noise contamination for
observations. Similar behavior can be observed even if the sample size increases. As useful as
SVD and PCA are, it is undeniable that they are difficult to explain to collaborators and be
translated into actionable information. When they are contaminated by noise, the problem
becomes much harder. Thus, the idea of smoothing the data, the covariance operator and/or
the right singular vectors appears naturally in this context. In the next section we will discuss
exactly how to do this smoothing and explain the intrinsic connection between these options.
FIGURE 2.10: Percent variance explained by the right singular vectors when the true model
is a rank 2 SVD with noise. Results are for the simulated data (black curves) from the
model (2.21), where the vectors v1 and v2 are orthonormal.
WS = UΣ(Vt S) , (2.22)
where W = UΣVt is the SVD of W. This indicates that each row of Vt (column of V) is
smoothed with the same smoother matrix as the rows of W. This shows that (1) smoothing
the data is equivalent to smoothing the right singular vectors, and (2) the smoothing matrix
is the same for the data and right singular vectors. Thus, smoothing one or another is a
stylistic rather than a consequential choice.
Another approach is to smooth the covariance operator Wt W. Replacing the data ma-
trix W with WS results in the following class of “sandwich covariance estimators” [328, 331]
S = St Wt WS .
K
Key Methodological Concepts 57
Irrespective of the smoother matrix, S, these estimators are semi-definite positive and sym-
metric. That is, the estimators are guaranteed to be covariance matrices.
This approach works with any type of smoother, including parametric or penalized
smoothers. Choosing the smoothing parameter is slightly different from the univariate con-
text described in Section 2.3. The residuals after smoothing can be written in matrix format
as W − WSλ . Therefore, the sum of squares for error is ||W − WSλ ||2F , where ||A||2F is the
sum of squares of the entries of the matrix A (square of the Frobenius norm). With this
change, the smoothing parameter λ can be estimated using the same smoothing criteria
introduced for univariate smoothing. For example, [331] uses the pooled GCV adapted for
matrix operators, an approach that scales up to high-dimensional functional data. This is
implemented in the fpca.face function in the refund package. The fpca.sc function in
refund uses the tensor-product bivariate penalized splines. This approach does not scale
up as well as the fpca.face approach, but provides an automatic smoothing alternative.
Some estimators replace ri (sij )ri (sij ) with K W (sij , sij ), where K
W (·, ·) is obtained by
smoothing {(sij1 , sij2 ), ri (sij1 )ri (sij2 )} for all i = 1, . . . , n and all j1 , j2 = 1, . . . , pi . In
the original paper [283] kernel bivariate smoothers were proposed, though local polyno-
mial regression and smoothing splines were also suggested as alternatives. Local polynomial
58 Functional Data Analysis with R
smoothing was proposed by [334], where smoothing parameters were estimated by leave-
one-subject-out cross-validation. To date, automatic implementation of this approach has
proven to be computationally expensive. Bivariate penalized splines regression was proposed
by [62], where the smoothing parameter is estimated using a number of popular criteria.
The default choice in the fpca.sc function in the R package refund is the tensor-product of
bivariate penalized splines, but other choices are available. Finally, a fast penalized splines
approach using leave-one-subject-out cross-validation was proposed by [328]. The function
face.sparse in the face package [329] in R provides this implementation. We will rely on
this function for sparse FPCA because it is fast, stable and uses automatic smoothing.
By default, the smoothing parameter is chosen using optim or a grid search. However,
the smoothing parameter can also be manually specified using the lambda argument. In
the code above, we use 0.01 as the smoothing parameter by specifying lambda = 0.01. The
eigenvalues and eigenfunctions are stored as evalues and efunctions elements in the fitted
object. Therefore, the smooth covariance can be calculated as follows.
Figure 2.11 displays the raw covariance function (upper-left panel) and the smooth
estimator of the covariance functions using fpca.face function with different smoothing
parameters, including when λ = 0.01 (upper-right), λ = 1 (bottom-left), and λ = 100
(bottom-right). Both the x-axis and y-axis represent the time of day from midnight to mid-
night. Higher covariance values are colored in red while lower covariance values are colored
Key Methodological Concepts 59
FIGURE 2.11: Estimated raw covariance (upper-left) and smooth covariance functions for
the NHANES accelerometry data using the fpca.face function with different smoothing
parameters shown in the title of each panel.
in blue. In the raw covariance, the highest values correspond to the morning period between
6 AM and 11 AM and the evening period between 5 PM and 9 PM. This result suggests that
the physical activity intensity tends to have higher variability in the morning and evening.
As the smoothing parameter increases, we observe a smoother covariance function across
the domain and the highest covariance value becomes smaller. The color coding was kept
identical in all panels to make the comparison among various covariance estimators possible.
To obtain the correlation function after covariance smoothing, one way is to use the
cov2cor function in the stats package. Figure 2.12 displays the correlation plots corre-
sponding to the covariance plots displayed in Figure 2.11. The top-left panel corresponds to
un-smoothed correlation, indicating mostly positive estimators (shades or red) with stronger
correlations around the main diagonal. This is due to the fact that physical activity tends
to persist to be high or low for short periods of time. However, some light shades of blue can
60 Functional Data Analysis with R
FIGURE 2.12: Estimated raw correlation (upper-left) and smooth correlation functions for
the NHANES accelerometry data using the fpca.face function with different smoothing
parameters shown on the title of each panel.
be observed farther away from the diagonal, which could indicate that individuals who tend
to be more active during the day, may be less active during the night. As the amount of
smoothing increases, correlations become stronger and at times that are farther away; note
the changes toward darker shades of red in wider bands around the main diagonal as the
smoothing parameter, λ, increases. The negative correlations off diagonal (shades of blue)
become more localized and clearly defined.
The first line provides the first log CD4 count for study participant 1, which was recorded
on month 9 before sero-conversion. Recall that 0 indicates the time of sero-conversion, while
positive times are months from sero-conversion, and negative times are months before sero-
conversion. The first study participant has three observations at 9 and 3 months before sero-
conversion and at 3 months after. The second study participant has their first observation
3 months before sero-conversion. In this format, the data is roughly 4 times smaller, as it
is recorded in a 1877 × 3 dimensional matrix compared to the original format, which is a
366 × 61 dimensional matrix.
Once data are in this format, the powerful face.sparse function in the R package face
[329] can be used. It uses automatic penalized splines smoothing and fits the data in less
than 10 seconds on a standard laptop.
The function produces estimators of the mean and covariance as well as predictors for
each study participant at each of the time points in the vector argvals.new. The estimated
smooth mean, covariance, and correlations can be extracted as described below.
The next step is to obtain all the cross products of residuals withing the same study
participant. Because there are 3 observations, there are 32 = 9 combinations of these
62 Functional Data Analysis with R
observations. The data containing the cross products of residuals has the structure shown
below. Consider the first row, which corresponds to the cross product r1 (−9)r1 (−9) =
(−0.546)2 = 0.298. The corresponding values for the variable sj1 is −9 and for vari-
able sj2 is −9. This indicates that the cross product corresponds to (s1j1 , s1j2 ), where
j1 = j2 = 1 and s1j1 = s1j2 = −9. Similarly, the second row corresponds to the cross prod-
uct r1 (−9)r1 (−3) = (−0.546) × (0.016) = −0.009. The corresponding values for the variable
sj1 is −9 and for variable sj2 is −3. This indicates that the cross product corresponds to
(s1j1 , s1j2 ), where j1 = 1, j2 = 2, s1j1 = −9, and s1j2 = −3. If a vector ri contains the
residuals for study participant i, then the vector of cross products, Kwi, and of time points
can be obtained as follows.
Once the data are obtained for each study participant level, the study participant ma-
trices of cross product residuals (Kwi) are bound by rows. The final matrix of cross products
is 11,463 × 3 dimensional and is stored in Kw. This happened because we consider all mu-
tual products (for a total of 11,463) compared to the total number of observed data points
(1,877). Figure 2.13 displays the location of each pair of observations (sij1 , sij2 ) (column
sj1 versus column sj2). Points are jittered both on the x and y axis by up to three days.
The color of each point depends on the size of the absolute value of the product of residuals,
|ri (sij1 )ri (sij2 )|, (the absolute value of the Kw). Darker colors correspond to smaller cross
products. The most surprising part of the plot is that it is quite difficult to see a pattern.
One may identify higher residual products in the upper-right corner of the plot (higher
density of yellow dots), but the pattern is far from obvious. There are two main reasons
for that: (1) the products of residuals are noisy as they are not averaged over all pairs of
observations; and (2) there is substantial within-person variability. This is a good example
of the limit of what can be done via visual exploration and of the need for smoothing.
Figure 2.14 displays the smooth estimator of the covariance function using the
face.sparse function. Both the x- and y-axis represent time from sero-conversion and
the matrix is symmetric. The surface has larger values towards the upper-right corner in-
dicating increased variability with the time after sero-conversion. The increase is gradual
over time and achieves its maximum 40 months after sero-conversion. A small increase in
variability is also apparent in the lower-left corner of the plot, though the magnitude is
smaller than in the upper-right corner. Covariances are smaller for pairs of observations
taken before and after sero-conversion (note the blue strips in the left and lower parts of
the plot). These are effects of variability and not of correlation. Indeed, Figure 2.15 displays
the smooth correlation function for the CD4 counts data using the face.sparse. The yel-
low strip indicates that the estimated correlations are very high for a difference of about
Key Methodological Concepts 63
FIGURE 2.13: Product of residuals, ri (sij1 )ri (sij2 ) after subtracting a smooth mean esti-
mator for every pair of months (j1 , j2 ) where data was collected for study participant i.
Coloring is done on the absolute value |ri (sij1 )ri (sij2 )|.
FIGURE 2.14: Estimated smooth covariance function for the CD4 counts data using the
face.sparse function on data shown in Figure 2.13.
64 Functional Data Analysis with R
FIGURE 2.15: Estimated smooth correlation function for the CD4 counts data using the
face.sparse function on data shown in Figure 2.13.
10 months between observations in the first 30 months of the study (days −20 to day 10).
For the last 30 months of the study, the high correlations extend to 15 and even 20 months
(note the yellow band getting wider in the left-upper corner). So, as time passes, variabil-
ity of observations around the true mean increases, but the within-person correlations also
increase. This is likely due to the overall decline in CD4 counts.
3
Functional Principal Components Analysis
65
66 Functional Data Analysis with R
dimension reduction on the correlation operator. We will show that there is a better way of
accounting for these assumptions and, especially, for noise.
The conceptual framework for functional PCA starts by assuming that the observed
data is
Wi (sj ) = Xi (sj ) + ij , (3.1)
where ij ∼ N (0, σ2 ) are independent identically distributed random variables. Here Wi (sj )
are the observed data, Xi (sj ) are the true, unobserved, signals and ij are the independent
noise variables that corrupt the signal to give rise to the observed data. The fundamental
idea of functional PCA (FPCA) is to assume that Xi (·) can be decomposed using a low-
dimensional orthonormal basis. If such a basis exists and is denoted by φk (s), for k =
1, . . . , K, then model (3.1) can be rewritten as
K
Wi (sj ) = ξik φk (sj ) + ij , (3.2)
k=1
which is a standard regression with orthonormal covariates. Of course, there is the problem
of estimating the functions φk (·) and dealing with the potential variability in these esti-
mators [104]. However, the assumptions of smoothness and continuity can be imposed by
making assumptions about the underlying signals, Xi (·), or equivalently, about the basis
functions φk (·), k = 1, . . . , K. This can be done even though the noise, ij , may induce
substantial variability in the observed data, Wi (·). Ordering, self-consistency, and colocal-
ization are implied characteristics of functional data and will be used to analyze data that
can reasonably be fit using model (3.1). The model is especially useful in practice when the
number of basis functions in (3.2), K, is relatively small, which can substantially reduce the
problem complexity.
Our notation in (3.2) connects FPCA with the discussion of Gaussian Processes in
Section 2.2 and to covariance smoothing in Section 2.5. Specifically, the basis φk (s),
for k = 1, . . . , K is obtained through an eigendecomposition of the covariance surface
KW (s1 , s2 ) = Cov{W (s1 ), W (s2 )} after it has been smoothed to remove the effect of noise
and to encourage similarity across adjacent points on the functional domain. Throughout
this chapter we will focus on the expression in (3.2) to emphasize the process that gen-
erates observed data, but will refer regularly to the underlying covariance surface and its
decomposition.
where K = 4, ξik ∼ N (0, λK ), is ∼ N (0, σ2 ), ξik and is are mutually independent for all
i, k, and s. We set the number of study participants to n = 50, the number grid points
to p = 3000, the variances of the scores to λk = 0.5k−1 for k = 1, . . . , 4, and the standard
deviation of the noise to σ = 2. These parameters can be adjusted to provide different
signal to noise ratios. Here we considered a relatively high-dimensional case, p = 3000,
even though the fpca.face [329, 330] function can easily handle much higher dimensional
examples.√ For the orthonormal√functions, φk (·), we used
√ the first four Fourier
√ basis functions
φ1 (s) = 2 sin(2πs), φ2 (s) = 2 cos(2πs), φ3 (s) = 2 sin(4πs), φ4 (s) = 2 cos(4πs).
Functional Principal Components Analysis 67
FIGURE 3.1: True eigenfunctions φk (·) with k = 1, . . . , 4 used for simulating the data in
model (3.2).
Figure 3.1 displays these four eigenfunctions on [0, 1]. The first two eigenfunctions have
a period of 1 with slower fluctuations (changes in the first derivative) and the second
two
1 2eigenfunctions have a period of 2 with faster fluctuations. Note that by construction
1 1
0 k
φ (s)ds = 1, but it can be shown that 0
{φ 3 (s)} 2
ds = 16 0 {φ1 (s)}2 ds, where f (·)
denotes the second derivative of the function f (·). In addition, a straight line has an inte-
gral of the square of the second derivative equal to zero. The integral of the square of the
second derivative is used extensively in statistics as a measure of functional variability and
a penalty on this measure is often referred to as a smoothing penalty. This is different from
the mathematical analysis definition of the smoothness of a function which is defined by the
number of continuous derivatives. Note that both φ1 (·) and φ3 (·) are infinitely continuously
differentiable while the integral of the square of the second derivative of φ3 (·) is 16 times
larger compared to the integral of the square of the second derivative of φ1 (·). This is a
quantification of the observed behavior of the two functions, where φ3 (·) fluctuates more
than φ1 (·). There is substantial confusion between these two concepts and exactly how they
are useful in practice. In statistical analysis the number of continuous derivatives is con-
trolled by the choice of the basis function and the addition of a penalty on the square of the
second derivative is used to avoid over-fitting. It is this penalty that does the heavy lifting
in real-world applications.
code here is adapted from the user manual of the function fpca.face and all parameters
can be modified to produce a variety of data generating mechanisms.
#Number of subjects
n <- 50
#Dimension of the data
p <- 3000
#A regular grid on [0,1]
t <- (1:p) / p
#Number of eigenfunctions
K <- 4
#Standard deviation of the random noise
sigma eps <- 2
#True eigenvalues
lambdaTrue <- c(1, 0.5, 0.25, 0.125)
#Construct eigenfunctions and store them in a p by K matrix
phi <- sqrt(2) * cbind(sin(2 * pi * t), cos(2 * pi * t), sin(4 * pi * t), cos(4 *
pi * t))
Functional data with noise is then built by first simulating the scores, ξik ∼ N (0, λK ).
The first line of code below simulates independent identically distributed N (0, 1) variables
and stores them in an n × K dimensional matrix. The second line of code re-scales this
matrix so that the entries in column k have the true variance equal to λk . This could have
been done in one line instead of two, but the code would be less pedagogical. The third
line of code constructs the signal part of the functional data by multiplying the n × K
dimensional matrix of scores xi with the K × p dimensional matrix t(phi) containing the
K eigenfunctions. The resulting n×p dimensional matrix X contains the individual functions
(without noise) stored by rows. The last step is to construct the observed data matrix W
by adding the functional signal X with an n × p dimensional matrix with entries simulated
independently from a N (0, σ2 ) distribution.
The code shows that model (3.2) is a data generating model, which provides a platform
for creating synthetic data sets that could be used for model evaluation and refinement. A
first step in this direction is to visualize the type of functions simulated from this model.
Recall that functions are stored in the n × p dimensional matrix W.
It is impractical to plot more than two such functions in the same figure. To partially
address this problem, Figure 3.3 displays all 50 functions as a heat map, where each function
is displayed on one row and the values of the function are represented as colors (red for low
negative values, yellow for values closer to zero, and blue for high positive values). For use
of dynamic sorting and visualization of heat maps see [286]. Some structure may become
apparent from this heat map, though the noise makes visualization difficult. Indeed, to
achieve this level of contrast, colors had to be adjusted to enhance the extreme negative and
positive values. Along most rows one can see a wavy pattern with low frequency (functions
go from low to high or high to low and back). Some higher frequencies may be observed,
but are far less obvious as they are hidden both by the noise and by the higher signal in
the lower frequencies.
FIGURE 3.3: All 50 functions simulated from the model (3.2). Each function is shown in
one row where red corresponds to negative smaller values, yellow corresponds to values
closer to zero and blue corresponds to positive larger values. Color has been enhanced for
extreme values to improve visualization.
The calls to the two functions, prcomp and fpca.face, are similar, which should make
the use of fpca.face relatively familiar. The difference is mainly internal, where fpca.face
accounts for the potential noise around the functions and uses the functional structure of
the data to smooth the covariance function and its eigenfunctions.
Both the prcomp and the fpca.face functions produce eigenvectors that need to be
√
re-scaled by multiplication with p to make them comparable with the true eigenfunc-
tions. Recall that the orthonormality in the functional and vector space are similar up to
a normalizing constant (due to the Riemann sum approximation to the integral). Indeed,
the functions φk (·) are orthonormal in L2 , but data are actually observed on a grid sj ,
j = 1, . . . , p. Thus, we need to work with the vector φk = {φk (s1 ), . . . , φk (sp )} ∈ Rp and
not with the function φk : [0, 1] → R. Even though the functions φk (·) are orthonormal in
√
L2 , the φk vectors are not orthonormal in Rp . Indeed, they need to be normalized by 1/ p
to have norm one. Moreover, the cross products are close to, but not exactly, zero because
of numerical approximations. Here L2 refers to the space of square integrable functions
with domain [0, 1]. It would be more precise to write, L2 ([0, 1]), but we use L2 for notation
simplicity instead.
Functional Principal Components Analysis 71
To better explain this, consider the L2 norm of any of the functions. It can be shown
1
that 0 φ2k (t)dt = 1 for every k, which indicates that the function has norm 1 in L2 . The
integral can be approximated by the Riemann sum
p
p
1
1 2
1= φ2k (t)dt ≈ (tj − tj−1 )φ2k (tj ) = φ (tj ) =
0 j=1
p j=1 k
1 √ √
||φk ||2 = {φk / p}t {φk / p} .
=
p
p
Here tj = j/p for j = 0, . . . , p and ||φk ||2 = j=1 φ2k (tj ) is the Euclidian norm of the
vector φk ∈ Rp . This is different, though closely related to, the L2 norm in the functional
space on [0, 1]. The norm in the vector space is the Riemann sum approximation to the
integral of the square of the function. The two have the same interpretation when the
√
observed vectors are divided by p (when data are equally spaced and the distance between
grid points is 1/p).
Figure 3.4 displays the first four true eigenfunctions as solid blue lines. For one simulated
data set, it also displays the first four estimated eigenfunctions using raw estimators obtained
using the function prcomp (shown as gray lines) and functional estimators using the function
fpca.face (shown as dotted red lines). Note that the raw estimators are much noisier than
FIGURE 3.4: True (solid blue lines), raw estimated (gray solid lines), and smooth estimated
(red dotted lines) eigenfunctions. Raw estimators are obtained by conducting a PCA analysis
of the raw data using the prcomp function. The smooth estimators are obtained using the
fpca.face function.
72 Functional Data Analysis with R
FIGURE 3.5: True (solid blue line), raw estimated (gray solid line), and smooth estimated
(red dotted lines) eigenvalues. Raw estimators are obtained by conducting a PCA analysis
of the raw data using the prcomp function. The smooth estimators are obtained using the
fpca.face function.
the true functions and would be much harder to interpret in the context of an application.
Moreover, it is hard to believe that a true underlying process would change so much for
a minuscule change in time. In contrast, the FPCA produces much smoother and more
interpretable eigenfunctions that are closer in mean square error to the true eigenfunctions.
Neither the raw nor the smooth estimators of the eigenfunctions are centered on the
true eigenfunctions. However, the smooth estimators seem to track pretty closely to an
imaginary smooth average through the raw estimators. This happens because we only show
results from one simulation. Unbiased estimators would refer to the fact that the average of
the red (or gray) curves over multiple simulations is close to the true eigenfunction. This is
not required in one simulation, though it is reassuring that the general shape of the function
is preserved.
Figure 3.5 displays the true eigenvalues as a solid blue line (λk = 0.5k−1 for k = 1, . . . , 4
and λk = 0 for k > 5). The dotted gray line displays the estimated eigenvalues using raw
PCA, which ignores the noise in the data. Note that all eigenvalues are slightly overesti-
mated, especially for k > 5. Moreover, the decline in the raw estimated eigenvalues is very
slow, almost linear after k > 5. In applications, such a slow decrease in eigenvalues could be
due to the presence of white noise, though, unfortunately, structured noise (with substan-
tial time or space correlations) may also be present. In contrast, the estimated eigenvalues
using FPCA track the true eigenvalues much closer. They are not identical due to sampling
variability, but indicate much lower bias compared to the raw PCA approach.
Functional Principal Components Analysis 73
Figure 3.6 displays the true (solid blue lines), and smooth estimated (red dotted lines)
eigenfunctions with 80% missing data. Results illustrate that the FPCA recovers the overall
structure of the true eigenfunctions even when 80% of the data are missing. Moreover, the
eigenfunction estimators are quite smooth and comparable to the ones obtained from the
complete data. Part of the reason is that every function is observed at p = 3000 data
points and even with 80% missing data there still remain ∼ 600 data points per function.
This many points are enough to recover the overall shape of each function in spite of
the low signal-to-noise ratio. The iteration between functional and covariance smoothing
is specific to functional data and relies strongly on the assumptions of smoothness of the
underlying signal. Such assumptions would be highly tenuous in a non-functional context,
where information is not smooth across covariates. Indeed, consider the case when the data
matrix contains age, body mass index (BMI), and systolic blood pressure (SBP) arranged
in this order. In this case, age is “not closer” to BMI than to SBP, as ordering of covariates
is, in fact, arbitrary.
When missing data are pushed to extremes, say to 99.9% missing, this approach no
longer works appropriately and needs to be replaced with methods for sparse functional data.
Indeed, with only 2 to 4 observations per function there is not enough information to recover
the functional shape by smoothing the individual functional data first. Moreover, the FACE
approach to smoothing individual curves relies on the assumption that missing data are not
systematically missing, particularly on the boundaries. For example, for a curve that has
the first 500 observations missing, FACE uses linear extrapolation to estimate this portion
of the curve. If a large number of curves have this type of missingness, other approaches
may be more appropriate. We discuss FPCA for sparse and/or irregularly sampled data in
Section 3.3.
74 Functional Data Analysis with R
FIGURE 3.6: True (solid blue lines), and smooth estimated (red dotted lines) eigenfunctions
with 80% missing data. Smooth estimators are obtained using the fpca.face function, but
with 80% missing data.
including the NHANES 2011-2014 data used in this book, are available through the website
http://www.FunctionalDataAnalysis.org.
The NHANES data used in this application come from minute-level “activity profiles,”
representing the average MIMS across days of data with estimated wear time of at least
95% (1,368 minutes out of the 1,440 in a day). Specifically, if we let Wim (s) denote the
ith individual’s recorded MIMS at minute s on day m, and Tim be their estimated wear
time in minutes for day m, then the average activity profile for participant i can be ex-
pressed as Wi (s) = |{m : Tim ≥ 1368}|−1 {m:Tim ≥1368} Wij (s). For this application we
use the NHANES 2011-2014 physical activity data and include all participants aged 5 to 80
with at least one day of accelerometry data with Tim ≥ 1368, resulting in total of 11,820
participants.
Note that although NHANES is a nationally representative sample, obtaining repre-
sentative estimates for population quantities and/or model parameters requires the use of
survey weights and/or survey design [156, 185, 274]. Because the intersection of survey
statistics and functional data analysis is a relatively new area of research [32, 33, 34, 226]
with few software implementations available, we do not account for survey weights in the
data application here or in subsequent chapters.
3.1.2.2 Results
Figure 3.7 displays the first four eigenfunctions estimated using FPCA (left panel) and PCA
(right panel). The number of PCs is coded by color: PC1 red, PC2 green, PC3 blue, and
PC4 magenta; see the legend in the right-upper corner of the figure. A quick comparison
indicates that the two approaches provide almost identical patterns of variability. This is
reassuring as both approaches are estimating the same target. Results are also consistent
with our simulation results in Figure 3.4 where the FPCA and PCA results showed very
similar average patterns, but the PCA results were much noisier.
In NHANES, the PCA results are also noisier, but the noise is not as noticeable, at least
at first glance. This is likely due to the fact that the results in Figure 3.4 are based on 50
FIGURE 3.7: First four eigenfunctions estimated using FPCA (left panel) versus PCA (right
panel) in NHANES. The number of the PC is coded by color: PC1 red, PC2 green, PC3
blue, and PC4 magenta.
76 Functional Data Analysis with R
functions, whereas results in Figure 3.7 are based on more than 12,000 functions. However,
even in this case it would be difficult to use the PCA results because one cannot explain the
small, but persistent, variations around the overall trends. Indeed, during a quick inspection
of the right panel in Figure 3.7, it seems reasonable to discard the small variations around
the pretty clearly identified patterns.
We will have a closer look at the first two eigenfunctions. The first PC (red) is slightly
negative during the night (midnight to around 7 AM), strongly positive during the day
(from around 9 AM to 9 PM), and decreasing in the evening (10 PM to 12 AM). The
second PC (green) is strongly positive between 9 AM and 3 PM and strongly negative
during late afternoon and night.
In spite of the wide acceptance of PCA and SVD as methods for data reduction, their
widespread use in applications is limited by difficulties related to their interpretation. In-
deed, one often needs to conduct PCA and then see if the results suggest simpler, easier to
understand summaries based directly on the original data. Figure 3.8 is designed to further
explore the interpretation of PCs. The left top panel displays the physical activity trajec-
tories of 10 individuals in the 90th percentile of scores on PC1 (thin red lines). The thick
red line is the average trajectory over individuals whose scores are in the 90th percentile.
These trajectories are compared to physical activity trajectories of 10 individuals in the 10th
percentile of scores on PC1 (thin red lines). The thick blue line is the average trajectory
over individuals whose scores are in the 10th percentile. In this example (PC1, shown in
left top panel) the difference between the trajectories shown in red and in blue is striking.
Indeed, the trajectories shown in red indicate much higher physical activity during the day
compared to those shown in blue. Moreover, some of the trajectories shown in blue corre-
spond to higher activity during the night. It may be interesting to follow up and investigate
whether the higher physical activity during the night is due to systematic differences or, as
suggested by the plot, to a few outlying trajectories.
The other three panels show similar plots for PC2, PC3, and PC4, respectively. Differ-
ences between groups continue to make sense, though they are less striking. This happens
because the higher numbered PCs explain less variability. However, results for PC2 are still
quite obvious: trajectories corresponding to the 90th percentile (red) correspond to much
higher physical activity during the night, much lower in early morning to about 2 PM and
much higher activity in the evening compared to trajectories corresponding to the 10th per-
centile (blue). Notice, again, that the difference between the thick red line (average of the
trajectories with scores in the upper decile of PC2) and thick blue line (average of the tra-
jectories with scores in the lower decile of PC2) is reflected in the shape of PC2 displayed in
Figure 3.7. The difference is that Figure 3.8 provides the contrast between observed groups
of trajectories of physical activity.
FIGURE 3.8: Average of MIMS profiles with scores in the highest 90th (thick red lines) and
lowest 10th (thick blue lines) percentiles for each of the first four principal components in
NHANES. For each group, a sample of 10 individual MIMS profiles for those individuals in
the 90th (thin red lines) and 10th (thin blue lines) of PC scores are also shown.
of steps or active seconds in every minute. In many other applications, data are collected
directly in discrete format (e.g., binary, multinomial, counts) [91, 145, 147, 267, 268, 288]
or have strong departures from normality [92, 280].
The question is whether the FPCA ideas can still be used in this context. This is not
straightforward, as multivariate PCA and FPCA are developed specifically for Gaussian
data and provide a decomposition of the observed variability along orthogonal directions of
variation. The corresponding separability of the sum of squares (variance) along directions
of variation (eigendirections) does not extend to non-Gaussian outcomes and a different
strategy is needed. Here we discuss three approaches for conducting such analyses, highlight
their practical implementations in R, and compare computational feasibility and scalability.
78 Functional Data Analysis with R
ηi (·) is the linear predictor, h(·) is a link function, β0 (·) is a function that varies along
the domain S, φk (·) are assumed to be orthonormal functions in the L2 norm, and ξik are
mutually uncorrelated random variables. The distribution of Wi (sj ) could belong to the
exponential family or any type of non-Gaussian distribution. The expectation is that the
number of orthonormal functions, K, is not too large and the functions φk (·), k = 1, . . . , K,
are unknown and will be estimated from the data. The advantage of this conceptual model is
that, once the functions φk (·) are estimated, the model becomes a generalized linear mixed
model (GLMM) with a small number of uncorrelated random effects. This transforms the
high-dimensional functional model with non-Gaussian measurements into a low-dimensional
GLMM that can be used for estimation and inference. The random effects, ξik , have a specific
structure, as they are assumed to be independent across i and k. These coefficients can be
viewed as independent random slopes on the orthonormal predictors φk (sj ). In practice,
often φ1 (·) ≈ 1, which is, strictly speaking, a random intercept. Thus, the model can be
viewed as a GLMM with one random intercept and K −1 random slopes, but this distinction
is not necessary in our context. The independence of random effects, orthonormality of φk (·),
and the small number of random effects, K, are important characteristics of the model that
will be used to ensure that methods are computationally feasible.
A general strategy is to estimate ηi (sj ) = h{µi (sj )} and then conduct FPCA on
these estimators. We now describe three possible ways of achieving this and illustrate
them using the binary active/inactive profiles from the NHANES 2011-2014 data. To ob-
tain binary active/inactive profiles, we first threshold participants’ daily MIMS data as
∗
Wim (s) = 1{Wim (s) ≥ 10.558}, where Wim (s) corresponds to the ith individual’s MIMS
unit on day m = 1, . . . , Mi at minute s. We then define their active/inactive profile as
Wi∗ (s) = median{Wim ∗
(s) : j = 1, . . . , Mi }. For example, if Mi = 7, Wi∗ (s) is 0 if study
participant i was inactive at time s for at least 4 days and 1 otherwise. When Mi is even,
say Mi = 2Ki , the median is defined as the (Ki + 1)th largest observation. The threshold
for active/inactive on the MIMS unit scale was 10.558, as suggested in [142]. In this sec-
tion, we use the notation Wi∗ (s) to represent the estimated binary active/inactive profiles
in NHANES at every s. In general, we have used Wi (s) to denote the continuous MIMS
observations at every point or to refer to a general functional measurement.
Figure 3.9 displays the binary active/inactive data for four study participants as a
function of time starting from midnight and ending at midnight. Each dot corresponds to
one minute of the day with zeros and ones indicating inactive and active minutes, respec-
tively. The red lines show the smoothed estimates of these binary data as a function of the
Functional Principal Components Analysis 79
FIGURE 3.9: A sample of 4 individual active/inactive profiles. The red lines represent
smoothed estimates of the profiles obtained by binning the data into 60-minute windows
and taking the binned mean, corresponding to the estimated probability of being active at
a given period of the day.
time of day, with smoothing done by binning the data into 60-minute windows and taking
binned averages. These estimates do not borrow strength across study participants, are not
specifically designed for binary outcome data, and are highly dependent on the smoothing
procedure (i.e., choice of bin width). Moreover, many study participants will have an es-
timated probability of being active of zero, which is unlikely and cannot be mapped back
to the linear scale of the predictor (logit of zero does not exist). For these reasons we are
interested in fitting models of the type (3.3) that can extract a small number of directions
of variation in the linear predictor space and borrow strength across study participants. In
the next sections we review three different approaches to achieve this: fast GFPCA using
80 Functional Data Analysis with R
local mixed effects, binary PCA using the Expectation Maximization (EM) algorithm, and
Functional Additive Mixed Models (FAMM).
The next step of fitting model (3.3) is to interpolate the functions φk (·) on the original
grid of the data. This is necessary when the previous steps are applied in non-overlapping
intervals because the function is then estimated at a subgrid of points. If overlapping inter-
vals are used, this step is not necessary. For simplicity of presentation, below we use linear
interpolation. In practice this can be made more precise using the same B-spline basis used
for estimating the covariance operator by refund::fpca.face. This is the approach imple-
mented in the fastGFPCA package. The estimated eigenfunctions are stored in the 1,440×K
dimensional matrix phi local. The final step is to fit model (3.3) conditional on the esti-
mated φk (s) whose random effects are assumed to be uncorrelated. This is implemented in
82 Functional Data Analysis with R
the bam function [180, 320, 321] in the mgcv package, which allows nonparametric modeling
of the fixed intercept β0 (s) while accounting for random effects. Note that the function
bam uses the expression s(SEQN, by=phi 1, bs="re") to specify the random effect for the
first eigenfunction φk (·). In this implementation the SEQN variable (the indicator of study
participants) needs to be a factor variable.
One could think of the last step as refitting model (3.3), as the model is initially fit
locally to obtain initial estimators of β0 (s) and Xi (s). However, this is different because
we condition on the eigenfunctions, φk (s), which are unavailable at the beginning of the
algorithm. One could stop after applying FPCA to the local estimates and interpolate
the predicted log-odds ηi (sl ) = h{ µi (sj )} obtained from FPCA to obtain subject-specific
predictions. While this approach provides good estimates of φk (·), the estimators of the
subject-specific coefficients, ξik , and of the corresponding log-odds tend to be shrunk too
much due to the binning of the data during estimation; see [167] for an in-depth description
of why this extra step is essential.
The last step of the fitting algorithm can be slower, especially for large data sets. One
possibility to speed up the algorithm is to fit the model (3.3) to sub-samples of the data,
especially when the number of study participants is very large. For reference we will call
the method applied to the entire data fastGFPCA and the method applied to four subsets
of the data as modified fastGFPCA. In Section 3.2.5 we will show that the two methods
provide nearly indistinguishable results when applied to the NHANES data. As we have
shown, the method can be implemented by any user familiar with mixed model software,
though the R package fastGFPCA [324] provides a convenient implementation. In a follow-up
paper [341] it was shown that using Bayesian approaches may substantially speed up the last
step of the algorithm. This is due to the smaller number of random effects, independence
of random effects, and orthonormality of the functions φk (s), k = 1, . . . , K.
Figure 3.10 displays the same data as Figure 3.9, where the dots still represent the data
and the red lines provide a loess smoothing of the binary data. The blue lines indicate the
estimates of the logit{Pr(Wi∗ (s) = 1)} based on local GLMMs, while the gold lines represent
the same estimators after joint modeling conditioning on the eigenfunction estimators. When
applied to these four study participants, the procedure seems to produce sensible results.
Figure 3.11 displays the distribution of estimated probabilities (top panels) and log-odds
(bottom panels) of being active as a function of time of day for all study participants in this
analysis. Results for the complete fast-GFPCA approach are shown in the left panels, while
the binned estimation procedure using local GLMMs are shown in the right panels. For
every 30-minute interval a boxplot is shown for the values of all study participants. Note
that during the night, when the probability of being active is much lower, the complete
approach (right panels) provides more reasonable results as there is better and continuous
separation of individuals by their probability of being active during the night. A few study
participants are more active during the night (note the dots above 0.5 probability of being
active during the night), though fewer individuals are identified following the smoothing
procedure. Increased activity probability during the night could correspond to night-shift
work, disrupted circadian rhythms, or highly disrupted sleep. During the day, irrespective of
method, the median probability of being active culminates around 0.5 somewhere between
11 AM and 12 PM, stays high during the day and decreases in early evening. This agrees
with what is known about physical activity, but it is useful to have this quantification and
confirmation.
The local estimation method for GFPCA discussed in this section is novel and methods
work remains. Nevertheless, the method is (1) grounded in the well-understood GLMM
inferential methodology; (2) straightforward to implement for researchers who are familiar
with standard mixed model software; and (3) computationally feasible for extremely large
Functional Principal Components Analysis 83
FIGURE 3.10: Results from the local method of estimating GFPCA on the NHANES
data using 30-minute bins. Red lines correspond to 60-minute binned averages of the
original data, while blue and gold lines present the inverse-logit transformed estimates
of logit{Pr(Wi∗ (s) = 1)} of the point-wise and smoothed (final fast GFPCA) estimates,
respectively.
datasets. As a result, we recommend this as a first-line approach and potential reference for
other methods.
FIGURE 3.11: Population distribution of probability of being active (top panels) and log
odds (bottom panels) obtained from fast-GFPCA (left panels) and the binned estimation
procedure fit using local GLMMs (right panels).
mean function, β0 (s). The results can be sensitive to these choices and options should
be explored beyond the current default of one eigenfunction (npc=1) and eight B-spline
basis functions for estimating β0 (s) (Kt=8). For comparison purposes, in our NHANES
application with binary active/inactive profiles we use the registr::bfpca() function with
four eigenfunctions and eight basis functions for estimating β0 (s). We refer to this method
as the variational Bayes FPCA, or vbFPCA.
where Xi (·) ∼ GP(0, Σ) is a zero-mean Gaussian Process with covariance Σ. FAMM uses
penalized splines to model the study participant functional effects, Xi (s). This approach
works with a variety of response families, uses the entire data set to estimate Xi (s), and
conducts estimation, smoothing, and inference simultaneously. The approach can also in-
corporate additional covariates and random effects and was implemented in R using the
connection between nonparametric regression models and mixed effects models. An impor-
tant contribution of the FAMM methodology was to show how nonparametric functional
models can be transformed into nonparametric regression models that can then be fit using
the mgcv function.
To illustrate the procedure, consider a subset of the NHANES data (FAMM can-
not be fit on the full data) and fit a Gamma regression for the functional outcomes
Wi (s) ∼ Gamma{1, ηi (s)} and E{Wi (s)|Xi (s)} = ηi (s) = β0 (s) + Xi (s). This is the default
R implementation of Gamma regression in mgcv, which models the shape parameter using
a fixed scale parameter of 1. Under the FAMM approach, this can be fit using the following
syntax based on the bam function in the mgcv package.
Here the functional data are stored in long format. Specifically, the data are contained
in the vectors W (functional data), index (functional domain), and id fac (factor variable
corresponding to subject identifier). The ti term syntactically specifies a tensor product
smoother of the functional domain and subject identifier, which adds the following subject-
specific random effect to the linear predictor
K
Xi (s) = ξik Bk (s) .
k=1
Here {Bk (s) : k = 1, . . . , K} is a spline basis with K basis functions and the coefficients
are estimated as ξik = E[ξik |W ], the conditional expectation of ξik given the observed
data. The functional argument mc=c(F,T) in the ti function imposes the identifiability
n
constraint
n i=1 Xi (s) = 0 for all s ∈ S instead of the less stringent default constraint
i=1 s Xi (s) = 0. The mc argument is shorthand for “marginal constraint” and should
be used with caution, while the argument k=c(10,5) specifies that K = 10 marginal basis
functions should be used for estimating Xi (s). The memory requirements and computational
time increase substantially with even small increases in K, which could be problematic when
the subject-specific random effects have complex shapes.
Once Xi (s) are estimated, one can apply FPCA on these estimators and then proceed
as in Section 3.2.2. This was not the intended use of FAMM, but it provides an alterna-
tive to the local mixed effects models if we want to conduct dimension reduction on the
linear predictor scale for non-Gaussian data. Given the estimates X i (s), we may estimate
Cov{Xi (s), Xi (u)} using the moment estimator of Σ = Cov{Xi (s), X i (u)}. This covariance
matrix can be decomposed using either SVD or PCA to obtain the eigenvectors of Σ. The
model can be re-fit using the first K of these eigenvectors, assuming random effects are
independent. One potential limitation of FAMM is that its computational complexity is
cubic in the number of subjects. Thus, scaling up the approach could be difficult, though
fitting the models on subsets or subsamples of the data may alleviate this problem. This
approach can be implemented using standard software for estimating GLMMs as discussed
in Section 3.2.2. Here we use the gamm4 [276] function, which is based on the gamm function
86 Functional Data Analysis with R
from the mgcv [314, 319] package, but uses lme4 [10] rather than nlme [230] as the fitting
method.
FIGURE 3.12: Estimated population mean function (first row) and the first four estimated
eigenfunctions (rows 2–5) in NHANES. Model estimates are presented as red (vbFPCA)
or blue lines (fast-GFPCA). The two leftmost columns correspond to vbFPCA (Kt=8 in
column 1 and Kt=30 in column 2). The six rightmost columns correspond to fast-GFPCA
with different input parameters (overlapping versus non-overlapping windows) and window
sizes (w = 6, 10, 30). Estimates of the population mean function based on the modified
fast-GFPCA are displayed as dashed lines.
vbFPCA provides biased estimators of the mean and that the bias may affect the latent ran-
dom process variation. Simulation results in [167] further reinforce this hypothesis. Updates
to the registr package [323] may fix this problem in the future.
3.2.6 Recommendations
We suggest using the fast-GFPCA as the first-line approach to estimating GFPCA, ex-
ploring sensitivity to choice of window size, and overlapping versus non-overlapping win-
dows. The fastGFPCA function [324] is available, the step-by-step description in this
section can be used as a guide, and one can also check the associated website http:
//www.FunctionalDataAnalysis.com to see specific implementations. A modified version
of the approach, where sub-samples of the data are used in the last step of the fitting al-
gorithm can substantially increase computational feasibility. Other approaches, including
vbFPCA and FAMM, can also be used and compared to fast-GFPCA to see if they provide
similar results. Subsampling of individuals and undersampling or smoothing of observations
within individuals may help reduce computational burden and produce results that can be
compared with existing solutions.
88 Functional Data Analysis with R
where β0 (·) is a smooth population mean, Xi (s) is a zero mean Gaussian Process with
covariance Σ, and ij are independent N (0, σ2 ) random variables. An estimator of the
mean function, β0 (s), can be obtained by smoothing the pairs {sij , Wi (sij )} under the
independence assumption using any type of smoother. We prefer a penalized spline smother,
but most other smoothers would work as well. Residuals can then be calculated as ri (sij ) =
Wi (sij ) − β0 (sij ).
To estimate and diagonalize Σ, the idea is to consider every pair of observations
(sij1 , sij2 ) and calculate the products ri (sij1 )ri (sij2 ) and then smooth these products using
a bivariate smoother. An important methodological contribution was provided by [328],
who developed a fast method for covariance estimation of sparse functional data and imple-
mented it in the face package via the face::face.sparse() function [329]. We introduce
the motivating dataset, discuss how to apply face.sparse() to such data sets, and provide
the context for the methods.
FIGURE 3.13: Distribution of sampling points for each study participant in the CONTENT
study. Each study participant is shown on one row and each red dot indicates a time from
birth when the child was measured.
functional domain and then use MoM estimators. However, results may depend on the bin
size and binning strategy because data are sparse and/or unequally sampled in particular
areas of the functional domain.
Figure 3.13 displays the days from birth when each study participant was measured.
Each red dot represents a time of a visit from birth and each study participant is shown on
a different row. The higher density of points in the first three months is due to the design of
the study. The intention was to collect weekly measurements in the first three months of the
study. After the first three months observations become sparser and only few children have
observations for more than one year. Observations for each child are quite regular, reflecting
the design of experiment, though there are missing visits and visits are not synchronized
across children.
Figure 3.14 displays the z-scores for length (zlen, shown in blue) and weight (zwei, shown
in red) for four study participants. The data for all study participants was displayed before
in Figure 1.9, though here we show a different perspective. Indeed, note that for Subject
49 the two time series decrease almost continuously from close to 0 at first measurement to
around −2 sometime after day 600. The decrease seems to be synchronized indicating that
the baby did not grow in length or weight as fast as other babies, at least relative to the
WHO standard. Subject 73 has increasing trajectories both for zlen and zwei, where the
baby started around −1.5 for both curves and ended around 0, or close to the WHO mean
around day 500 from birth. The increase in zwei (z-score for weight) was faster for the first
200 days, while the increase in zlen (z-score for length) was more gradual. The data for
Subject 100 was only collected up to day 250 from birth and indicates that the baby was
born longer than the WHO average, grew in length relative to the WHO average in the first
150 days, though the last couple of measurements indicate that the baby’s length went back
90 Functional Data Analysis with R
FIGURE 3.14: Example of trajectories of z score for length (zlen, blue points) and weight
(zwei, red points) for four children in the CONTENT data.
to the WHO average. The z-score for weight seemed to have an overall increasing pattern,
with the exception of the last measurement. Finally, Subject 112 had data collected up to
around day 500, had a roughly constant z-score for length (baby was close to the WHO
average at every age), but a much lower than average z-score for weight (baby was much
below the WHO average for weight at every age). A couple of times, around day 200 and 330,
there were increases in z-score for weight, but they were followed by further decreases. While
far from complete, the data for these four babies provides an indication of the complexity of
the growth trajectories and a glimpse into the close and possibly heterogeneous interactions
between the length and weight processes.
The variable id is the identifier of individuals in the data and is repeated for every row
that corresponds to an individual. The column labeled ma1fe0 is the variable that indicates
the sex of the child with males being indicated by a 1 and females indicated by a 0. For
example, the first study participant id=1 is a female and the second study participant
id=2 is a male. The sex indicator is repeated for each study participant. The third column
weightkg is the weight in kilograms and the fourth column height is the height (or length)
of the baby in centimeters. The fifth column agedays is the age in days of the baby. In our
models this is tij and it depends on the subject i (id=i) and the observation number j for
that baby. For example, t11 = 61 days indicating that the first measurement for the first
baby was taken when she was 61 days old. Similarly, t13 = 71 days, indicating that the third
measurement for the first baby was taken when she was 71 days old. The sixth column cbmi
is the child’s bmi. The seventh column zlen is the z-score for length of the child, while the
eighth column zwei is the z-score for weight. The data set contains two additional variables
that were omitted for presentation purposes.
3.3.3 Implementation
There are several implementations of FPCA for sparse/irregular data in R. In particular, the
face package contains the face.sparse() function, which is easy to use, fast, and scalable.
It also produces prediction and confidence intervals. Assume that the CONTENT data is
contained in the content df variable. The code below shows how to create the data frame
and fit face::face.sparse for the zlen variable.
This fit takes only seconds on a standard laptop and provides automatic nonparametric
smoothing of the covariance operator, which includes optimization over the smoothing pa-
rameter. Below we show how to extract the estimated population mean, covariance, variance,
and correlation. We also show how to produce pointwise prediction intervals for observations.
Figure 3.15 displays these estimators. The panel in the first row and column displays
the estimated population mean function (m in R code and β0 (s) in the statistical model).
The trajectory of the mean function increases between day 0 and 200 (from around −0.6 to
around −0.3), decreases between day 200 and 450 (from −0.3 to −0.4), and increases again
from day 450 to day 700 (from around −0.4 to −0.25). The mean is negative everywhere
indicating that the average length of babies was lower than the WHO average. The increasing
trend indicates that babies in this study get closer in length to the WHO average as a
92 Functional Data Analysis with R
FIGURE 3.15: Some results using face::face.sparse function using the CONTENT data.
Panel (1,1): Smooth estimator of the population mean as a function of time from birth. Panel
(1,2): Smooth estimator of the population standard deviation as a function of time. Panel
(2,1): Complete CONTENT data (gray lines), mean and mean ± 2 standard deviations;
Panel (2,2): Smooth estimator of the correlation function.
function of time. The apparent decrease in the mean function between day 200 and 450
could be real, but may also be due to the increased sparsity of the data after day 200. Indeed,
it would be interesting to study whether babies who were lighter in the first 100 days were
more likely to stay in the study longer. The panel in the first row second column displays
the estimated pointwise standard deviation function (sqrt(Cov diag)) as a function of
time from birth. Note that the function is pretty close to 1 across the functional domain
with a slight dip from around 0.94 in the first 200 days to 0.90 around day 500. This
is remarkable as the data is normalized with respect to the WHO population, indicating
that the within population variability of the length of children is pretty close to the WHO
standard. The panel in the second row first column displays all the CONTENT data as gray
lines, the mean function in blue, and the mean plus and minus two standard deviations in
red. The prediction interval seems to capture roughly 95% of the data, which suggests that
the pointwise marginal distributions are not too far from Gaussian distributions. The panel
in the second row second column displays the smooth estimate of the correlation function
(Cor). The plot indicates high positive correlations even 700 days apart, which is consistent
with biological processes of child growth (babies who are born longer tend to stay longer).
Correlations are in the range of 0.9 and above for time differences of 100 days or less,
between 0.8 and 0.9 for time differences between 100 to 400 days, and below 0.8 for time
differences longer than 400 days.
Functional Principal Components Analysis 93
FIGURE 3.16: CONTENT data: first three principal components estimated using sparse
FPCA.
Figure 3.16 displays the first three principal components (PCs) estimated based on the
CONTENT data. These components explain more than 99% of the observed variability
after smoothing (90% for PC1, 8.6% for PC2, and 1.4% for PC3). PC4 has an eigenvalue
which is an order of magnitude smaller than the eigenvalue for PC3.
The first PC (shown in blue) is very close to a random intercept, with a curvature that
is barely visible. The second component (shown in green) is very close to a random slope,
though some curvature is still present in this component. It is remarkable that 98.6% of
the random effects variability (excluding the error of the random noise) can be captured
by a random intercept and a random slope. This is good news for someone who analyzes
these data using such a model. However, the sparse FPCA allowed us to quantify how
much residual variability is explained (or lost) by using such a model. Moreover, the third
component allows for trajectories to bend somewhere between day 200 and 400 from birth.
This is an important feature that allows the model to better adapt to the observed data.
Indeed, let us have a second look at the plot in Figure 3.14 and inspect the z-score for length
(zlen, shown as red dots) for each of the four study participants shown. For Subjects 49,
73, and 112 it is quite apparent that allowing for a change of slope between day 200 and
400 would visually improve the fit. This is not a proof, as we are looking at only 4 study
participants, but it provides a practical look at the data guided by sparse FPCA.
Another way to think about the problem is that PC3 could easily be approximated by a
linear or quadratic spline with one knot around day 300 from birth. Thus, the sparse FPCA
model suggests a linear mixed effects model with a random intercept, one random slope
94 Functional Data Analysis with R
on time, and one random slope on a quadratic spline with one knot at 300. This becomes
a standard mixed effects model. Such a direction could be considered for modeling and it
is supported by the sparse FPCA analysis. While we do not pursue that analysis here, we
note that running first sparse FPCA and then learning the structure of the random effects
could be an effective strategy for fitting simpler, traditional models without worrying too
much about what could be lost.
We now show how to produce predictions for one study participant at a given set of
points using the face.sparse function.
The dati pred variable is a data frame with the same column structure as the original
data used for model fitting. The data frame has a number of rows equal to the number of
observations for the subject being predicted, nrow(dati), plus the number of grid points
used for prediction, k. The first nrow(dati) rows contain the observed data for the subject.
The last k rows correspond to the data that will be predicted. For example, the last k rows
of the argvals variable contain the grid of points for prediction seq. The last k rows of the
w variable contains NA because these values are predicted. The last k rows of the variable
subj are just repeats of the subject’s ID. Once this structure is complete, predictions are
obtained using the following code.
The predictions are stored in the variable mean pred while the standard errors are stored
in se pred.
Figure 3.17 displays the z-score for length for the same study participants as in Fig-
ure 3.14 (data shown as blue dots). However, the plot also displays the prediction of the
individual curves (solid red lines) as well as the pointwise 95% confidence intervals for the
subject-specific mean function (dashed red lines). This is not the same as the 95% pointwise
Functional Principal Components Analysis 95
FIGURE 3.17: Predictions of study participants z-score for length (zlen) level together with
pointwise 95% confidence intervals for four study participants.
prediction intervals, which would be larger. The population mean function is also displayed
as a solid black line. The fits to all four data sets appear reasonable and display some
amount of curvature that is quite different from what would be obtained from a random
intercept random slope model. Consider, for example, study participant 49, whose z-score
for length decreases steadily (linearly) from birth to around day 400 close to a very low
−3. However, after that the z-scores plateau around −2.5, the model captures this change
relatively well, which is most likely due to the third principal component.
Sparse FPCA produces predictions both outside and inside the range of the observed
data for a particular baby. Predictions outside the range are useful to provide future pre-
dictions based on the data available for the specific baby while borrowing strength from the
data available for the other babies. Moreover, the model can be used to estimate the data
at birth or during the period before the first measurement. Consider, for example, study
participant 100. The first observation was at 28 days and the last observation was at 196
days from birth. Our model predicts that the z-score for length for this baby was 0.85 on
the first day and 0.49 on day 700 from birth.
Prediction within the range of the observations is also very useful. For example, study
participant 100 had observations 1.27, 0.99, and 1.25 on days 99, 112, and 126, respectively.
The model predicts that the z-scores increased very slowly from day 100 to day 130 from
0.96 to 1.0. In contrast, study participant 112 had observations −1.61, −1.61, −1.74 on
days 98, 113, 125, respectively. The model predicts that during this period the scores for
this baby decreased slowly from −1.52 to −1.56. Note that data for study participants 100
and 112 are not collected on the same days from birth. However, the model allows us to
compare the z-scores on every day from birth. This is also helpful if we are interested in
96 Functional Data Analysis with R
prediction. Suppose that we would like to predict a specific outcome based on the data up
to day 150 for all babies. It makes sense to use sparse FPCA to produce predictions of the
z-scores for every day between birth and day 150 and then use these inferred trajectories
for prediction of outcomes. This becomes a standard scalar on function regression (SoFR)
or function-on-function regression (FoFR) depending on whether we predict a scalar or
functional outcome. We will discuss these implications in more detail in Chapters 4 and 6,
respectively.
FIGURE 3.18: The first four estimated eigenfunctions when data are simulated as indepen-
dent normal random variables with mean zero Wi (s) ∼ N (0, σ02 ).
FIGURE 3.19: The first fifty estimated eigenvalues when data are simulated as independent
normal random variables with mean zero Wi (s) ∼ N (0, σ02 ).
98 Functional Data Analysis with R
FIGURE 3.20: First ten N (µi , σ 2 ) pdfs, where µi ∼ U[0, 1] and σ = 0.01. Each color
corresponds to a different function.
The functions Wi (s) are stored in an n × p dimensional matrix W, where each row cor-
responds to one individual and each column corresponds to an observation at that location.
PCA was applied to this matrix and the eigenfunctions and eigenvalues were obtained. Fig-
ure 3.21 displays the first four eigenfunctions corresponding to the highest four eigenvalues.
Knowing the structure of the data used for simulations, the principal components do not
seem to capture interesting or important patterns of the data. This is supported by the fact
that the first eigenfunction explains 5.7% while the second eigenfunction explains 4.9% of
the observed variability. The first eigenvector indicates a peak around 0.4, which is likely
due to chance in this particular sample, probably because some bumps may be clustered
around this value.
Figure 3.22 displays the first 50 largest eigenvalues of the matrix W ordered in decreas-
ing order as a function of the eigenvalue index from the largest to the smallest. Just as in
the case of random noise data, the decrease in eigenvalues is slow and close to linear as a
function of the eigenvalue index. Note that this problem cannot be solved by smoothing the
data (FPCA) because data are already perfectly smooth. These examples provide potential
explanations for observed behavior of eigenvalues in various practical applications. In gen-
eral, real data will contain a real signal that can be captured by PCA (typically captured
by fast decreasing eigenvalues), noise and de-synchronization of signal, both of which will
contribute to slowing down the decrease in eigenvalues.
Functional Principal Components Analysis 99
FIGURE 3.21: First four eigenfunctions (corresponding to the highest four eigenvalues)
estimated from the data simulated as N (µi , σ 2 ) pdfs, where µi ∼ U[0, 1] and σ = 0.01.
FIGURE 3.22: First 50 eigenvalues in decreasing order (from the highest to the lowest)
estimated from the data simulated as N (µi , σ 2 ) pdfs, where µi ∼ U[0, 1] and σ = 0.01.
Based on 250 simulated functions, each evaluated at 3000 points.
4
Scalar-on-Function Regression
The basic assumption of FDA is that each observed function is an individual, self-contained
and complete “data point.” This suggests the need to extend standard statistical methods
to include such data; we now consider regression analyses when covariates are functions.
Regression models with functional predictors avoid reducing those predictors to single-
number summaries or treating observed data within each individual function as a separate,
unrelated observation. Instead, this class of regression approaches generally assumes that
the association between the predictor and outcome is smooth across the domain of the
function to (1) account for the correlation structure within the function; (2) control the
change of the association effect for a small change in the functional argument; and (3) allow
for enough flexibility to capture the complex association between the scalar outcome and
functional predictor. Data analysis where outcomes are scalars and some of the predictors
are functions is referred to as scalar-on-function regression (SoFR). This type of models has
been introduced by [245, 294, 304], while the first use of the SoFR nomenclature can be
traced to [251].
Scalar-on-function regression has been under intense methodological development, which
generated many different approaches and publications. Here we will not be able to refer to
all these approaches, but we will point out the many types of applications of SoFR: physical
activity [56, 75], chemometrics [82, 99, 108, 192, 212, 252, 298], crop yield prediction [224],
cardiology [248], intensive care unit (ICU) outcome analysis [93], brain science [102, 124, 182,
188, 253, 339], methylation analysis [174], climate science [12, 83], electroencephalography
[24, 186, 222], simulated earthquake data [11], and continuous glucose monitoring (CGM)
[194], just to name a few. While these papers are referenced here for their specific area of
application, they contain substantial methodological developments that could be explored
in detail. Also, this list is neither comprehensive nor does it relate to all aspects of SoFR
methodology and applications. The overall goal of this chapter is not to explore the vast
array of published methodological tools for SoFR. Instead, we will focus on linear models
where the coefficient function is estimated using a basis expansion. We will emphasize the
use penalized splines, the equivalence of these models with linear mixed effects models
for inference, and the flexibility of well-developed software such as refund and mgcv. The
objective is to ensure that enough readers can get started with fitting SoFR models using
stable and reproducible software.
In this chapter we describe the general motivation for SoFR, and build intuition using
exploratory analyses and careful interpretation of coefficients. Methods using unpenalized
basis expansions are implemented in traditional software, while estimation and inference for
SoFR using penalized splines is conducted using the refund and mgcv packages.
101
102 Functional Data Analysis with R
nhanes df =
readRDS(
here::here("data", "nhanes fda with r.rds")) %>%
mutate(
death 2yr = ifelse(event == 1 & time <= 24, 1, 0)) %>%
select(
SEQN, BMI, age, gender, death 2yr,
MIMS mat = MIMS, MIMS sd mat = MIMS sd) %>%
filter(age >= 25) %>%
drop na(BMI) %>%
tibble()
Scalar-on-Function Regression 103
In the next code chunk, we convert MIMS mat and MIMS sd mat to tidyfun objects using
tfd(). Note that we use the arg 1 argument in tfd() to be explicit about the grid over
2
which functions are observed: 60 , 60 , .... 1440
60 , so that minutes are in 1
60 increments and
hours of the day fall on integers from 1 to 24. Here and elsewhere, the use of the tidyfun
package is not required, but takes advantage of a collection of tools designed to facilitate
data manipulation and exploratory analysis when one or more variables is functional in
nature. Indeed, although we will make extensive use of this framework in this chapter, the
use of matrices or other data formats is possible.
nhanes df =
nhanes df %>%
mutate(
MIMS tf = matrix(MIMS mat, ncol = 1440),
MIMS tf = tfd(MIMS tf, arg = seq(1/60, 24, length = 1440)),
MIMS sd tf = matrix(MIMS sd mat, ncol = 1440),
MIMS sd tf = tfd(MIMS sd tf, arg = seq(1/60, 24, length = 1440)))
The next code chunk contains two components. The first component creates a new data
frame, nhanes bin df, containing average MIMS values in two-hour bins by computing the
rolling mean of each MIMS tf observation with a bin width of 120 minutes, and then evalu-
ating that rolling mean at hours 1, 3, ..., 23. The result is saved as MIMS binned, and for the
next step only BMI and MIMS binned are retained using select(). The second component of
this code chunk fits the regression of BMI on these bin averages. The tf spread() function
produces a wide-format dataframe with columns corresponding to each bin average in the
MIMS binned variable, and the call to lm() regresses BMI on all of these averages. We save
the result of the regression in an object called fit binned.
nhanes bin df =
nhanes df %>%
mutate(
MIMS binned =
tf smooth(MIMS tf, method = "rollmean", k = 120, align = "center"),
MIMS binned = tfd(MIMS binned, arg = seq(1, 23, by = 2))) %>%
select(BMI, MIMS binned)
fit binned =
lm(BMI ∼ .,
data = nhanes bin df %>% tf spread(MIMS binned))
We now show the binned predictors and the resulting coefficients. The first plot gener-
ated in the code chunk below shows the MIMS binned variable for the first 500 rows (other
data points are omitted to prevent overplotting). Because MIMS binned is a tidyfun ob-
ject, we use related tools for plotting with ggplot. Specifically, we set the aesthetic mapping
y = MIMS binned and use geometries geom spaghetti() and geom meatballs() to show
lines and points, respectively. We also set the aesthetic color = BMI to color the resulting
plot by the observed model outcome. The second plot shows the coefficients for each bin
averaged MIMS values. We create this plot by tidying the model fit stored in fit binned
and omitting the intercept term. An hour variable is then created by manipulating the
coefficient names, and upper and lower 95% confidence bounds for each hour are obtained
by adding and subtracting 1.96 times the standard error from the estimates. We plot the
estimates as lines and points, and add error bars for the confidence intervals.
104 Functional Data Analysis with R
The resulting panels are shown in Figure 4.3. The binned MIMS profiles (left panel) show
expected diurnal patterns of activity, where there is generally lower activity at night and
higher activity during the day. Binning has collapsed some moment-to-moment variability
observed in minute-level MIMS profiles, making some patterns easier to observe; here, we
see a general trend that participants with higher BMI values had lower levels of observed
physical activity during the daytime hours; trends in the morning, evening, and nighttime
are less obvious based on this plot. The results of the regression using bin-averaged MIMS
values as predictors for BMI (right panel) are consistent with these observed trends. Coef-
ficients for bin averages during the day are generally below zero and some (2-hour intervals
between 8-10 AM and 6-8 PM) are statistically significant. Interestingly, coefficients in the
morning and parts of the night are positive, suggesting that higher activity in these times
is associated with higher BMI values. All bin averages are used jointly in this model, so co-
efficients can be interpreted as adjusting for the activity at other times of the day; standard
errors and confidence intervals also reflect the correlation in activity over the course of the
day.
Our choice of 2-hour bins was arbitrary; this may create enough bins to capture changes
in the association between MIMS and BMI over the course of the day without becoming
overwhelming, but other choices are obviously possible. Indeed, a common choice (in physical
activity and other settings) is to average over the complete functional domain, effectively
using one bin. Doing so in this setting produces a single measure of total activity which,
arguably, could suffice to understand associations between activity and BMI. However,
we found that a single average over the day performed notably worse: twelve 2-hour bins
produced a fit with an adjusted R2 of 0.0267, while the single average yielded an adjusted
R2 of 0.0167. Alternatively, one could use more bins rather than fewer. These models are
relatively easy to implement through adjustments to the preceding code, and the results are
not surprising: as bins become smaller, the coefficients become less similar over time and the
trends are harder to interpret. Even in Figure 4.1, it is not clear if the changes in coefficient
values in the first three bins reflect a true signal or noise. Conceptually, scalar-on-function
regression models are intended to avoid issues like bin size by considering predictors and
coefficients as functions and ensuring smoothness.
We will motivate a shift to scalar-on-function regression by recasting the preceding
j
model based on bin averages. Notationally, let sj = 60 for 1 ≤ j ≤ 1440 be the grid over
Scalar-on-Function Regression 105
FIGURE 4.1: Left panel: NHANES physical activity profiles averaged in 2-hour intervals as
a function of time from midnight (labeled 0) to midnight (labeled 24). Individual trajectories
are colored from low BMI (dark blue) to high BMI (yellow). Right panel: Pointwise and 95%
confidence intervals of the association between average physical activity measured in every
2-hour interval and BMI. All 2-hour intervals are used as regressors in the same model.
which functions are observed and Xi (sj ) be the observed MIMS value for subject i at time
sj . Additionally, let X ib be the average MIMS value for subject i in bin 1 ≤ b ≤ 12. For
example,
120
1
X i1 = Xi (sj )
120 j=1
is the average MIMS in the first two hours of the day. Additionally, let βb be the regression
coefficient corresponding to bin average b in the model regressing BMI on bin average MIMS
values. A key insight is that
1
120
β1 X i1 = β1 120 j=1 Xi (sj )
120 (4.1)
= β21 601
X
j=1 i j(s )
2 β1
≈ 0 2 Xi (s) ds ,
where the approximation in the last line results from the numeric approximation to the
true integral between hours 0 and 2 of the day. This notation also emphasizes that Xi (s) is
conceptually a continuous function, although it is observed over a discrete grid.
Taking this a step further, define a coefficient function β step (s) over the same domain
as MIMS values through β1
2, 0<s≤2,
β2
β step (s) = 2 , 2<s≤4, (4.2)
...
β12
2 , 22 < s ≤ 24 .
106 Functional Data Analysis with R
That is, the model using bin averages can be expressed in terms of a specific functional
coefficient and functional predictors by integrating over their product. In this case, the spe-
cific functional coefficient is a step function with step heights equal to half of the coefficient
in the bin average model. The rescaling that converts bin coefficients to the step function
depends on the bin width and on the domain for s (e.g., a two-hour bin on [0, 24] requires
halving the coefficient). Assuming s ∈ [0, 24] is observed on an equally spaced grid with
1
length 1,440 yields time increments of 60 , which are used in the numeric approximation to
the integral term. Alternatively assuming s ∈ [0, 1] or s ∈ [0, 1440], both of which could
be reasonable in this case, would require some slight modifications to (4.1) and (4.2) that
would affect the scale of β(s), but not the value of the approximate integrals.
The code chunk below creates a dataframe that contains the coefficient function defined
in (4.2). We use the tidied output of fit binned, again omitting the intercept term and
now focusing only on the estimated coefficients in the estimate variable. Using slice(),
we repeat each of these values 120 times. The next steps use mutate() to define the method
name (a variable that will be used in later sections of this chapter), divide the estimate by
2, and define the arg variable in a way that is consistent with the specification of MIMS tf
in nhanes df. Finally, we use tf nest() to collapse the long-form data frame into a dataset
containing the coefficient function as a tidyfun object.
stepfun coef df =
fit binned %>%
broom::tidy() %>%
filter(term != "(Intercept)") %>%
select(estimate, std.error) %>%
slice(rep(1:12, each = 120)) %>%
mutate(
method = "Step",
estimate = .5 * estimate,
arg = seq(1/60, 24, length = 1440)) %>%
tf nest(.id = method)
A plot showing the complete (not binned) MIMS tf trajectories alongside the step coef-
ficient function is shown in Figure 4.2. The difference between this plot and the one shown
in Figure 4.1 is that in Figure 4.1 we took the average of the functions in a bin and then
regressed the outcome on the collection of bin averages. In Figure 4.2 we obtained the fitted
values by integrating the product of predictor functions (left) and the coefficient function
(right); these two approaches are identical up to a re-scaling of the regression parameters.
Connecting the bin average approach to a truly functional coefficient is an intuitive starting
point for the more flexible linear SoFR models considered in the next section.
FIGURE 4.2: Left panel: NHANES physical activity profiles averaged in two-hour intervals
as a function of time from midnight (labeled 0) to midnight (labeled 24). Individual trajec-
tories are colored from low BMI (dark blue) to high BMI (yellow). Right panel: pointwise
estimators of the step-wise association between physical activity and BMI.
regression, and will be useful for introducing key concepts in interpretation, estimation,
and inference for models with functional predictors. Later sections will consider extensions
of this approach.
FIGURE 4.3: Interpretation of functional predictors in SoFR. First column of panels in-
dicate observed profiles for three individuals, Xi (s). The middle panel displays β(s) along
its domain. Final column of panels indicate the pointwise product between β1 (s)Xi (s).
The shaded area shows to S β1 (s)Xi (s)ds. For each shaded area, the number to the right
indicates the value of the corresponding integral.
very low levels of activity over the 24 hours of observation. These participants were selected
because they have different activity trajectories. The middle panel in Figure 4.3 displays
the coefficient β(·) over its domain; how this estimate was obtained will be discussed in
later sections of this chapter. The next column of panels contains the pointwise products
β1 (s)Xi (s). The shaded areas highlight the integrals S β1 (s)Xi (s)ds, and the numbers to
the right are the values of this integral.
The innovation in scalar-on-function regression, compared to nonfunctional models, is
a coefficient function that integrates with covariate functions to produce scalar terms in
the linear predictor. The corresponding challenge is developing an estimation strategy that
minimizes
n 2
min Yi − β0 − β1 (s)Xi (s) ds (4.5)
β0 ,β1 (s) S
i=1
E[Yi ] = β0 + S
β1 (s)Xi (s) ds
K
= β0 + Bk (s)Xi (s) ds β1k (4.6)
k=1 S
= β0 + Cti β1
where Cik = S Bk (s)Xi (s) ds, Ci = [Ci1 , . . . , CiK ]t , and β1 = (β11 , . . . , β1K )t is the vector
of basis coefficients. The result of the basis expansion for the coefficient function, there-
fore, is a recognizable multiple linear regression with carefully defined scalar covariates and
corresponding coefficients. Specifically, let y = (y1 , . . . , yn )t , the matrix X be constructed
by row-stacking vectors [1, Ci1 , . . . , CiK ], and β = [β0 , β11 , . . . , β1K ]t be the vector of re-
gression coefficients including the population intercept and spline coefficients. The ordinary
least squares solution for β is given by minimizing the sum of squares criterion
This very familiar expression, which is a practical reframing of (4.5), is possible due to the
construction of a design matrix X that is suitable for scalar-on-function regression.
Functional predictors are actually observed over a finite grid, and the definite integrals
that define the Cik are in practice estimated using numeric quadrature. In the illustration
constraining β1 (s) to be a step function, we approximated this integral using a Riemann
1
sum with bin widths (or quadrature weights) equal to 60 ; in general, we will use
1440
Cik ≈ 1
Xi (sj )Bk (sj ) (4.8)
60 j=1
throughout this chapter to approximate the integrals needed by our spline expansion. We
reiterate here that the choice of S and the implied quadrature weighting can affect the scale
of the resulting basis coefficient estimates; being consistent in the approach to numeric
integration in model fitting and in constructing subsequent fitted values or predictions for
new observations is critical, and failing to do so is often a source of confusion. We also
note that, except in specific cases like the step function approach, the basis coefficients β1k
k = 1, . . . , K are not individually interpretable and examining the coefficient function β1 (s)
provides more insights.
Many options for the basis have been considered in the expansive literature for SoFR.
To illustrate the ideas in this section, we start with a quadratic basis and obtain the corre-
sponding estimate of the coefficient function β1 (s). We define the basis
B1 (s) = 1 ,
B2 (s) = s , (4.9)
B3 (s) = s2 ,
and, given this, obtain scalar predictors Cik that can be used in standard linear model
software. The basis expansion includes an intercept term, which should not be confused
with the model’s intercept, β0 . The intercept in the basis expansion allows the coefficient
function β1 (s) to shift as needed, while the population intercept is the expected value of the
response when the predictor function is zero over the entire domain. Because the intercept
110 Functional Data Analysis with R
in the basis expansion is integrated with the predictor functions to produce Ci1 , the basis
expansion intercept B1 (s) does not introduce identifiability concerns with respect to the
model’s intercept β0 .
Continuing to focus on BMI as an outcome and MIMS as a functional predictor, the
code chunk below defines the quadratic basis and obtains the numeric integrals in the Cik .
For consistency with other code in this chapter, the grid over which functions are observed
1 2
is called arg and is set to 60 , 60 , .... 1440
60 . The basis matrix B is defined in terms of arg
and given column names int, lin, quad for the intercept, linear, and quadratic terms.
Next, we construct the data frame num int df, which contains the numeric integrals. The
first step is to multiply the matrix of functional predictors,
1440 stored as MIMS mat in the data
frame nhanes df, by the basis B. Doing so gives j=1 Xi (sj )Bk (sj ) for each i and k, and
1
multiplying by the quadrature weight 60 produces the numeric integrals defining the Cik
using (4.8). We retain the row names of the matrix product (inherited from the MIMS mat
matrix) in the resulting dataframe, and convert this to a numeric variable for consistency
with nhanes df.
num int df =
as tibble(
(nhanes df$MIMS mat %*% B) * (1/60),
rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
The next code chunk implements the regression and processes the results. We first de-
fine a new data frame, nhanes quad df, that contains variables relevant for the scalar-on-
function regression of BMI on MIMS trajectories using (4.4) and expand the coefficient
function β1 (s) in terms of the quadratic basis defined in the previous code chunk. This is
created by joining nhanes df and num int df using the variable SEQN as the key to define
matching rows. We keep only BMI and the columns corresponding to the numeric inte-
grals Cik . Using nhanes quad df, we fit a linear regression of BMI on the Cik ; the formula
specification includes a population intercept 1 to reiterate that the model’s intercept β0
is distinct from the basis expansion’s intercept, which appears in Ci1 . Finally, we com-
bine the coefficient estimates in fit quad with the basis matrix B to obtain the estimate
3
of the coefficient function. For any sj ∈ S, β1 (sj ) = k=1 β1k Bk (sj ) = B(sj )β1 , where
B(sj ) = [B1 (sj ), B2 (sj ), B3 (sj )] is the row vector of basis functions evaluated at sj . Let
s = {sj }1440
j=1 be the grid over which functions are observed and B(s) be the 1440 × 3 matrix
of basis functions evaluated over all entries sj ∈ s. This is obtained by stacking the 1 × 3
dimensional row vectors B(sj ) over sj , 1 = 1, . . . , p = 1440. If β1 (s) = {β1 (s1 ), . . . , β1 (sp )}t
is the p × 1 dimensional vector where the function is evaluated, then β1 (s) = B(s)β1 . In the
code below, we therefore compute the matrix product of B and the coefficients of fit quad
(omitting the population intercept), and convert the result to a tidyfun object using the
tfd() function with the arg parameter defined consistently with other tidyfun objects in
this chapter. The coefficient function is stored in a data frame called quad coef df, along
with a variable method with the value quad.
Scalar-on-Function Regression 111
nhanes quad df =
left join(nhanes df, num int df, by = "SEQN") %>%
select(BMI, int, lin, quad)
fit quad =
nhanes quad df %>%
lm(BMI ∼ 1 + int + lin + quad, data = .)
quad coef df =
tibble(
method = "Quadratic",
estimate = tfd(t(B %*% coef(fit quad)[-1]), arg = epoch arg))
This general strategy for estimating coefficient functions can be readily adapted to other
basis choices. The next code defines a cubic B-spline basis with eight degrees of freedom;
this is more flexible than the quadratic basis, while also ensuring a degree of smoothness
that is absent from the stepwise estimate of the coefficient function. Cubic B-splines are
an appealing general-purpose basis expansion and we use them throughout the book, but
other bases can be useful, depending on the context. For instance, in this application a
periodic basis (e.g., a Fourier or periodic B-spline basis) could be a good choice, since it
would ensure that the coefficient function began and ended at the same value.
num int df =
as tibble(
(nhanes df$MIMS mat %*% B bspline) * (1/60),
rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
nhanes bspline df =
left join(nhanes df, num int df, by = "SEQN") %>%
select(BMI, BS 1:BS 8)
fit bspline =
lm(BMI ∼ 1 + ., data = nhanes bspline df)
bspline coef df =
tibble(
method = "B-Spline",
estimate =
tfd(t(B bspline %*% coef(fit bspline)[-1]), arg = epoch arg)
Once the basis is defined, the remaining steps in the code chunk mirror those used to
estimate the coefficient function using a quadratic basis, with a small number of minor
changes. The basis is generated using the bs() function in the splines package, and there
are now eight basis functions instead of three. There is a corresponding increase in the
number of columns in num int df, and for convenience we write the formula in the lm()
call as BMI ∼ 1 + . instead of listing columns individually. The final step in this code
chunk constructs the estimated coefficient function by multiplying the matrix of basis func-
tions evaluated over s by the vector of B-spline coefficients; the result is stored in a data
112 Functional Data Analysis with R
frame called bspline coef df, now with a variable method taking the value B-Spline. The
similarity between this model fitting and the one using a quadratic basis is intended to
emphasize that the basis expansion approach to fitting (4.4) can be easy to implement for
a broad range of basis choices.
We show how to display all coefficient function estimates in the next code chunk. The
first step uses bind rows() to combine data frames containing the stepwise, quadratic, and
B-spline estimated coefficient functions. The result is data with three rows, one for each es-
timate, and two columns containing the method and estimate variables. Because estimate
is a tidyfun object, we plot the estimates using ggplot() and geom spaghetti() by set-
ting the aesthetics for y and color to estimate and method, respectively.
bind rows(stepfun coef df, quad coef df, bspline coef df) %>%
ggplot(aes(y = estimate, color = method)) +
geom spaghetti(alpha = 1, linewidth = 1.2)
In the resulting plot, shown in Figure 4.4, the coefficient functions have some broad
similarities across basis specifications. That said, the quadratic basis has a much higher
estimate in the nighttime than other methods because of the constraints on the shape of
the coefficient function. The stepwise coefficient has the bin average interpretation developed
in Section 4.1, but the lack of smoothness across bins is scientifically implausible. Of the
coefficients presented so far, then, the B-spline basis with eight degrees of freedom is our
preference as a way to include both flexibility and smoothness in the estimate of β1 (·). In
this case, diagnostic metrics also slightly favor the B-spline approach: the adjusted R2 for
the B-spline model is 0.0273 compared to 0.0267 for the stepwise coefficient and 0.0252 for
the quadratic basis fit.
The degree of smoothness is closely connected to the choice of degrees of freedom in
this approach. While it is possible (and in some cases useful) to explore this choice using
FIGURE 4.4: Estimates of the coefficient function β1 (s) in equation (4.4) obtained using
B-splines of degree 8 (purple), quadratic (green), and step function with bin sizes of length
two hours (yellow). The outcome is the BMI and the predictors are the MIMS profiles during
the day.
Scalar-on-Function Regression 113
traditional techniques for model selection in linear models, we will next incorporate explicit
smoothness constraints in a penalized likelihood framework for estimation.
Let X be an n × (K + 1) matrix in which the ith row is [1, Ci1 , . . . , CiK ], and let β be the
(K + 1) dimensional column vector that concatenates the population intercept β0 and the
spline coefficients β1 . Adding the second derivative penalty to the minimization criterion in
(4.7) yields a new penalized sum of squares
where 0a×b is a matrix of zero entries with a rows and b columns. This notation and speci-
fication intentionally mimics what was used for penalized scatterplot smoothing in Section
2.3.2, and many of the same insights can be drawn. For fixed values of λ, a closed form so-
Varying λ from 0 to ∞ will induce no penalization and full penalization,
lution exists for β.
respectively, and choosing an appropriate tuning parameter is an important practical chal-
lenge. As elsewhere in this chapter, though, we emphasize that the familiar form in (4.11)
should not mask the novelty and innovation of this model, which implements penalized
spline smoothing to estimate the coefficient function in a scalar-on-function regression.
We illustrate these ideas in the next code chunk, which continues to use BMI as an out-
come and MIMS as a functional predictor. The code draws on elements that have been seen
previously. We first define a B-spline basis with30 degrees of freedom evaluated over the
1 2
finite grid arg, previously defined to take values 60 , 60 , .... 1440
60 . Using functionality in the
splines2 package, we obtain the second derivative of each spline basis function evaluated
over the same finite grid. The elements of the penalty
1440 matrix P are obtained through nu-
1
meric approximations to the integrals through 60 j=1 Bi (sj )Bl (sj ). The design matrix
X is obtained by adding a column taking the value 1 everywhere to the terms Cik given by
the numeric integration of the predictor functions and the spline basis using (4.8). Next,
we construct the matrix D by adding a row and column taking the value 0 everywhere to
the penalty matrix P. The response vector y is extracted from nhanes df and we choose
high and low values for the tuning parameter λ. Given all of these elements, we estimate
the coefficient vector β through β = (Xt X + λD)−1 Xt y using the pre-selected values of λ
to obtain coef high and coef low. We note that these include estimates of the population
intercept as well as the spline coefficients.
y = nhanes df$BMI
FIGURE 4.5: Estimates of the coefficient function β1 (s) in equation (4.4) using penalized B-
splines of maximum degree 30 using low penalization (green) and high penalization (yellow).
The outcome is BMI and the predictors are the MIMS profiles during the day.
The estimated coefficient functions that correspond to the estimates in coef high and
coef low can be produced through simple modifications to the previous code, so we omit
this step. Figure 4.5 shows the resulting coefficient functions. Most strikingly, the choice of
λ has a substantial impact on the estimated coefficient function. With 30 basis functions
and low penalization, the estimated coefficient function (shown in green) is indeed very
wiggly – there is a high spike at the beginning of the domain and rapid oscillations over
the day. The more highly penalized estimate (shown in yellow), meanwhile, varies smoothly
from values above zero in the evening hours and below zero during the day. These can be
compared to the coefficient functions seen in Figure 4.4; in general, all coefficient functions
suggest temporal variation in the association between BMI and MIMS values, but the model
specification and choice of tuning parameter significantly impacts the resulting estimates.
Recasting the penalized sum of squares in (4.11) as a mixed model allows the data-
driven estimation of tuning parameters; more broadly, this opens the door to using a wide
range of methods for mixed model estimation and inference in functional regression settings.
Using an approach similar to that in Section 2.3.3, we note that (4.11) can be re-written
as a maximization problem. First, we will separate the population intercept from spline
coefficients; let 1n be a column of length n containing the value 1 everywhere and C be the
n × K dimensional matrix constructed by row-stacking the vectors [Ci1 , . . . , CiK ]. Letting
σ2 −1
λ = σ2 and multiplying by 2σ 2 , we now have
β1
Although we initially developed our objective function as a penalized sum of squares, the
same objective arises through the use of maximum likelihood estimation for the model
[y|β0 , β1 , σ2 ] = N (1n β0 + Cβ1 , σ2 Ip ) ;
1/2 t (4.13)
[β1 |σ ]
2 det(P) β 1 Pβ 1
β1 = exp − .
(2π)K/2 σβ1 2σβ21
116 Functional Data Analysis with R
As elsewhere, we are using the notation [y|x] to indicate the conditional probability density
function y given x. Using restricted maximum likelihood estimation (REML) for (4.13), we
estimate spline coefficients as best linear unbiased predictors; estimate the tuning parameter
λ by estimating the variance components σ2 and σβ21 ; and conduct inference using the mixed
effects model framework.
We make a few brief comments on the degeneracy of the conditional distribution [β1 |σβ21 ].
In the model construction we have described, the penalty matrix P may not be full rank.
Intuitively, the issue is that we only penalize departures from a straight line (because all
straight lines have a second derivative that is zero everywhere), but our basis spans a space
that includes straight lines. There are some solutions to this problem. First, one can mumble
vaguely about “uninformative priors” and otherwise ignore the issue; this works quite well
in practice but can raise eyebrows among more detail-oriented peers and book coauthors.
A second ad hoc solution is to construct the penalty matrix (αIK×K ) + P, where α is a
small value. The resulting penalty is no longer exactly a second derivative penalty because
it includes some overall shrinkage of spline coefficients, but it is full rank. The third solution
is to extract the intercept and linear terms from the basis and the penalty matrix to create
unpenalized and penalized terms that are equivalent to the original basis and penalty. The
technical details for this solution are beyond the scope of this book, but can be found in
[258, 319]. We first address this issue in Section 2.3.2, and refer to it in multiple chapters
throughout the book.
Recognizing that scalar-on-function regression with a penalty on the second derivative of
the coefficient function can be expressed as a standard mixed model allows generic tools for
estimation and inference to be applied in this setting. Below, we construct the design matrix
C using the numeric integration in (4.8) as elsewhere in this chapter. The penalty matrix
P, which contains the numeric integral of the squared second derivative of the basis func-
tions, is reused from a prior code chunk. We pass these as arguments into mgcv::gam() by
specifying C in the formula that defines the regression structure, and then use the paraPen
argument to supply our penalty matrix P for the design matrix C. Lastly, we specify the
estimation method to be REML. We omit code that multiplies the basis functions by the
resulting spline coefficients to obtain the estimated coefficient function, but plot the result
below. Previous penalized estimates, based on high and low values of the tuning parameter
λ, over- and under-smoothed the coefficient function; the data-driven approach to tuning
parameter selection yields an estimate that is smooth but time-varying.
FIGURE 4.6: Estimates of the coefficient function β1 (s) in equation (4.4) using penalized
B-splines of maximum degree 30 using low penalization (green), high penalization (yellow),
and REML penalization (purple). The outcome is BMI and the predictors are the MIMS
profiles during the day.
package, which adds functionality, quality-of-life features, and user interfaces relevant to
FDA. For SoFR, this means that instead of building models using knowledge of the linear
algebra underlying penalized spline estimation, we instead only require correctly specified
data structures and syntax for refund::pfr(). In the code chunk below, we regress BMI
on MIMS using the matrix variable MIMS mat using the linear specification in lf(). We
additionally indicate the grid over which predictors are observed, and specify the use of
REML to choose the tuning parameter. Among other things, the pfr() function organizes
data so that model fitting can be performed by calls to gam(); more details are provided in
Section 4.5. The next component of this code chunk extracts the resulting coefficient and
structures it for plotting.
pfr fit =
pfr(
BMI ∼ lf(MIMS mat, argvals = seq(1/60, 24, length = 1440)),
method = "REML", data = nhanes df)
pfr coef df =
coef(pfr fit) %>%
mutate(method = "refund::pfr()") %>%
tf nest(.id = method, .arg = MIMS mat.argvals) %>%
rename(estimate = value)
Figure 4.7 displays results obtained using mgcv::gam() and refund::pfr(); there are
some minor differences in the default model implementations and these results do not align
perfectly, although they are very similar and can be made exactly the same. For reference,
we also show the coefficient function based on an unpenalized B-spline basis with eight
degrees of freedom.
However, it is worth pausing and reflecting on the simplicity of the implementation of
the function on scalar implementation in refund::pfr. Indeed, all one needs to do is to
118 Functional Data Analysis with R
FIGURE 4.7: Estimates of the coefficient function β1 (s) in equation (4.4) using penalized
B-splines implemented in mgcv::gam() and refund::pfr(), as well as an unpenalized B-
spline with eight degrees of freedom. The outcome is BMI and the predictors are the MIMS
profiles during the day.
input the outcome vector, BMI in our case, and the predictor matrix, MIMS mat in our case,
which contains the observed functions for each study participant per row. As we will show,
this implementation easily expands to other types of outcomes, additional scalar covariates,
and additional functional predictors. Moreover, the outcome of this fit, pfr fit, contains
all the ingredients to extract model estimates and conduct inference, as we will show in
this chapter. In some sense, this is a culmination of research conducted by thousands of
researchers. Our major accomplishment here was to transform the SoFR into a penalized
spline regression, make the connection between FoSR and semiparametric regression, embed
this into a mixed effects model framework, and identify the software elements that work
well together and correspond to the correct inferential framework. This is not the only way
to conduct FoSR, not by a long shot. But it is one of the best ways we know to do it. We
hope that this will be helpful to many others.
where β0 (s) is apopulation mean function, the φk (s) are orthonormal eigenfunctions, and
the scores ξik = S {Xi (s) − β0 (s)} φk (s)ds are mean-zero random variables. We will use the
Scalar-on-Function Regression 119
K
same basis to express the coefficient function, so that β1 (s) = k=1 β1k φk (s). One could,
at this point, use numeric approximations to the second derivatives of the φk (s) to pursue
the penalized estimation techniques developed in Section 4.2.3. Instead, it is common to
assume that the largest directions of variation in the Xi (s) are also the most relevant for the
outcome Yi . This suggests selecting a truncation level K and proceed as in Section 4.2.2.
More specifically, we will use existing approaches for FPCA to define the basis
φ1 (s), . . . , φK (s). Next, we will let Cik = S Xi (s)φk (s)ds and Ci = [Ci1 , . . . , CiK ]t , so
that estimating basis coefficients β1 = [β11 , . . . , β1K ]t relies on OLS estimation using the
Ci as covariates. We again use numeric integration to obtain Cik , but first recall an is-
sue raised in Chapter 3: many implementations of FPCA implicitly use a quadrature
weight of 1 and return a basis that needs to be rescaled to have the correct properties
1 2 1440
under numeric integration. In our case, let s = {sj }1440 j=1 = ,
60 60 , .... 60 be the finite
observation grid. For an FPCA implementation that returns the basis φ∗1 (s), . . . , φ∗K (s)
1440 ∗ ∗ ∗
√
such that j=1 φk (sj )φk (sj ) = 1 for all k, we define φk (s) = φk (s) 60 so that
1
1440
60 j=1 φk (sj )φk (sj ) = 1. This basis can be used analogously to the quadratic and B-
spline bases seen in Section 4.2.2.
The code chunk below implements the scalar-on-function regression of BMI on MIMS us-
ing a data-driven basis. In the first lines of code, we use the function refunder::rfr fpca()
to conduct FPCA. This function is under active development, but takes a tf vector as an
input; for data observed over a regular grid, this serves as a wrapper for fpca.face(). We
specify npc = 4 to return K = 4 principal components. The remainder of this code chunk
is essentially copied from Section 4.2.2. Using naming conventions similar to previous code,
we define B fpca by extracting efunctions from nhanes fpca, and rescaling them to have
the desired numerical properties. Given this basis, we compute numerical integrals to de-
fine the Cik ; merge the resulting dataframe with nhanes df; retain only the outcome BMI
and predictors efunc 1:efunc 4; and fit the regression. We save this as fit fpcr int to
indicate that this conducts FPCR using numeric integration to obtain the covariates Cik .
We omit code that multiplies the basis coefficients and the basis expansion to produce the
estimated coefficient function, since this is identical to code seen elsewhere.
nhanes fpca =
rfr fpca("MIMS tf", data = nhanes df, npc = 4)
num int df =
as tibble(
(nhanes df$MIMS mat %>% B fpca) * (1/60),
rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
nhanes fpcr df =
left join(nhanes df, num int df, by = "SEQN") %>%
select(BMI, efunc 1:efunc 4)
We used numerical integration to obtain the Cik in the model exposition and example
code, but an important advantage of FPCR is that principal component scores are the
120 Functional Data Analysis with R
E[Yi ] = β0 + S
β1 (s)Xi (s) ds
= β0 + S
β1 (s)β0 (s) ds + S
β1 (s) {Xi (s) − β0 (s)} ds
(4.15)
K
= β0∗ + k=1 S
φk (s) {Xi (s) − β0 (s)} ds β1k
= β0∗ + ξit β1 ,
where ξi = (ξi1 , . . . , ξiK )t , β1 = (β11 , . . . , β1K )t , and β0∗ = β0 + S β1 (s)β0 (s) ds is the
population intercept when the covariate functions Xi (s) are centered.
The fact that FPCR can be carried out using a regression on FPC scores directly is
a key strength. There are many practical settings where the numeric integration used to
construct the design matrices throughout this chapter – for pre-specified and data-driven
basis expansions – is not possible. For functional data that are sparsely observed or that
are measured with substantial noise, numeric integration can be difficult or impossible.
In both those settings, FPCA methods can produce estimates of eigenfunctions and the
associated scores and thereby enable scalar-on-function regression in a wide range of real-
world settings. At the other extreme, for very high-dimensional functional observations,
it may be necessary to conduct dimension reduction as a pre-processing step to reduce
memory and computational burdens. The FPCR gives an interpretable scalar-on-function
regression in this setting as well. That said, because FPCR is a regression on FPC scores,
only the effects of Xi (s) on Yi that are captured by the directions of variation contained
in the φ1 (s), . . . , φK (s) functions can be accounted for using this approach. Moreover, the
smoothing of the estimated coefficient function depends on the intrinsic choice of the number
of eigenfunctions, K. This tends to be less problematic when one is interested in prediction
performance, but may have large effects on the estimation of the β1 (s) coefficient.
The code chunk below re-implements the previous FPCR specification. Because the un-
derlying FPCA implementation scaled the eigenfunctions φ∗k (s) using quadrature weights
∗
of 1, we first need ∗to appropriately rescale the principal component scores. Let ξik =
{Xi (s) − β0 (s)} φk (s)ds be the scores based on the incorrectly scaled eigenfunctions.
S
∗
√
Multiplying both ξik and φ∗k (s) by 60 addresses the scaling of the eigenfunctions; addi-
1
tionally, multiplying by the correct quadrature weight 60 produces scores ξik on the right
scale. This scaling process is necessary whether the FPCA method uses numeric integration
∗
for the ξik or estimates them using BLUPs in a mixed model. Indeed, both approaches are
built around incorrectly scaled eigenfunctions, φ∗k (s), and need to account for the quadra-
ture weight used for numeric integration to obtain fitted values from predictor functions.
As noted above, being careful about weights for numeric integration can be a point of con-
fusion in SoFR; small inconsistencies can produce coefficient functions with similar shapes
but very different scales, with corresponding differences in the fitted values. We have made
these mistakes too many times, and we hope that others will benefit from our experience
and avoid them.
In the code below, we extract scores from the FPCA object nhanes fpca obtained
in a previous code chunk. Mirroring code elsewhere, we create a dataframe containing the
scores; merge this with nhanes df and retain BMI and the predictors of interest; and fit
a linear model, storing the results as fit fpcr score to reflect that we have performed
Scalar-on-Function Regression 121
FPCR using score estimates. A table showing coefficient estimates from fit fpcr int and
fit fpcr score is shown after this code chunk. As expected, the intercepts from the two
models differ – one is based on centered Xi (s) covariate functions and the other is not –
but basis coefficients are nearly identical.
nhanes score df =
as tibble(
C, rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
nhanes fpcr df =
left join(nhanes df, nhanes score df, by = "SEQN") %>%
select(BMI, score 1:score 4)
FIGURE 4.8: Estimates of the coefficient function β1 (s) in equation (4.4) using FPCR with
the first four (green) and twelve (purple) smooth eigenfunctions of the predictor space.
Also shown is the estimated coefficient using a piece-wise constant spline every two hours
(yellow). The outcome is BMI and the predictors are the MIMS profiles during the day.
estimate can be very large, as increasing or decreasing the number of components by one
can significantly impact the shape of the estimated functional parameter.
Yet another approach could be to start with a large value for K and use a variable
selection criterion that retains only the PCs that are predictive of the response. Such criteria
could be p-values, cross-validation, or adjusted R-square. An important drawback of these
approaches is that they become much more complicated when one incorporates additional
functional or scalar covariates, random coefficients, non-parametric components, and non-
Gaussian outcomes. It is not the complexity of an individual component, but the joint
complexity of all components of the method that reduces the overall appeal of these methods.
The expression in (4.14) requires some additional discussion. Throughout this chapter we
have worked with the notation Xi (s), which typically refers to the underlying true functional
observation. Functional data are often measured with noise and the observed functional
process is denoted by Wi (s). The standard functional model connecting the observed and
true underlying processes is
which raises methodological questions about how and whether to account for measurement
error in functional predictors. This issue is not limited to data-driven basis expansions,
although it arises naturally in this setting. There can also be debate about whether a smooth,
unobserved Xi (s) should be considered the “true” predictor instead of the observed data
in Wi (s). In practice, the most popular strategies have been to (1) ignore the measurement
error and induce smoothness in the functional coefficient; or (2) to pre-smooth functional
Wi (s) using FPCA or another smoothing method applied to each predictor. Throughout
much of this chapter, we have taken the first approach by using observed MIMS trajectories
to construct necessary model terms. Effectively, this strategy assumes that the measurement
Scalar-on-Function Regression 123
error does not substantially impact the numeric integration and trusts that smoothness in
the basis expansion for the coefficient function is sufficient. The use of a data-driven basis
arguably pre-smooths functional predictors: although observed data are used to estimate
FPC scores, only those scores are included as predictors and any variation in the functional
predictor not accounted for in the first K FPCs is omitted. For more formal treatments of
addressing measurement error in scalar-on-function regression see, for example, [290, 291,
338].
which casts inference for β1 (s) in terms of inference for the estimated basis coefficients, β1 .
This is very useful because β1 (s) is infinite dimensional, whereas the dimension of the basis
coefficient vector β1 is finite and quite low dimensional.
When using a fixed basis and no penalization, the resulting inference is familiar from
usual linear models. After constructing the design matrix X and estimating all model coeffi-
cients, β1 , using ordinary least squares and the error variance, σ2 , based on model residuals,
Var(β1 ) can be extracted from Var(β1 ) = σ 2 (Xt X). Indeed, this can be quickly illustrated
using previously fit models; in the code chunk below, we use the vcov() function to obtain
Var(β) from fit fpca int, the linear model object for FPCR with K = 4. We remove
the row and column corresponding to the population intercept, and then pre- and post-
multiply by the FPCA basis matrix B(s) stored in B fpca. The resulting covariance matrix
is 1440 × 1440, and has the variances Var{β1 (s)} for each value in the observation grid
s ∈ s on the main diagonal. The final part of the code chunk creates the estimate β1 (s) as
a tf object, uses similar code to obtain the pointwise standard error (as the square root of
124 Functional Data Analysis with R
entries on the diagonal of the covariance matrix), and constructs upper and lower bounds
for the 95% confidence interval.
fpcr inf df =
tibble(
method = c("FPCR: 4"),
estimate = tfd(t(B fpca %*% coef(fit fpcr int)[-1]), arg = epoch arg),
se = tfd(sqrt(diag(var coef func)), arg = epoch arg)
) %>%
mutate(
ub = estimate + 1.96 * se,
lb = estimate - 1.96 * se)
The process for obtaining confidence intervals for penalized spline estimation is con-
ceptually similar. Again, inference is built on (4.17), and the primary difference is in the
calculation of Var(β1 ). The necessary step is to perform inference on fixed and random
effects in a mixed model. While the details are somewhat beyond the scope of this text, for
Gaussian outcomes and fixed variance components (or, equivalently, fixed tuning parame-
ters), a closed form expression for the covariance matrix is available. Earlier in this chapter,
we made use of the connection between penalized spline estimation for scalar-on-function
regression and mixed models. Inference can similarly be conducted directly via the connec-
tion to mixed effects models using appropriate software, and we pause to appreciate the
impact of coupling approaches to estimation and inference in SoFR with high-performance
implementations for mixed effects models. In later sections and chapters, we will explore
this connection in detail; for now, we will use the helpful wrapper pfr() for inference.
Indeed, the object pfr coef df, obtained in a previous code chunk using coef(pfr fit),
already includes a column se containing the pointwise standard error. In the code chunk
below, we use this column to construct upper and lower bounds of a 95% confidence interval.
pfr inf df =
pfr coef df %>%
mutate(
ub = estimate + 1.96 * se,
lb = estimate - 1.96 * se)
Our next code chunk will plot the estimates and confidence intervals created in the
previous code. We combine the dataframes containing estimates and confidence bounds
for the penalized spline and FPCR methods contained in pfr inf df and fpcr inf df,
respectively, which corresponds to using K = 4 PCs. In code not shown, the dataframe
fpcr inf df used to construct this plot was updated to also include estimates and confi-
dence bounds for FPCR using K = 8 and K = 12, respectively. The result is passed into
ggplot(), where we set the aesthetic y = estimate to plot the estimated coefficient func-
tions using geom spaghetti(). The next line uses geom errorband() to plot the confidence
band; this requires the aesthetics ymax and ymin, which map to columns ub and lb in our
dataframe. Finally, we use facet grid() to create separate panels based on the estimation
approach.
Scalar-on-Function Regression 125
Figure 4.9 shows the estimates and confidence intervals for the penalized spline and
FPCR methods using K = 4, 8, and 12 FPCs in separate panels. We saw in Figure 4.8
that the shape of coefficient functions estimated using FPCR depending on the choice of K,
which serves as a tuning parameter; we now see that the corresponding confidence intervals
are sensitive to this choice as well. When K = 4, the 95% confidence bands are narrow and
exclude zero almost everywhere, suggesting strong associations between the predictor and
response over the functional domain. Meanwhile, when K = 12 the confidence bands are
wide and include zero except in a few regions. The results obtained through penalized splines
with data-driven tuning parameter selection avoids this sensitivity, and leads to results
that are stable and reproducible. The estimate and confidence band from refund::pfr()
indicates significant negative associations between MIMS and BMI in the mid-morning
and evening, as well as significant positive associations in early morning. These findings
are qualitatively consistent with results obtained using other models, including the binned
regression approach and several choices of K in FPCR implementations. Our experience is
that the penalized spline model has good inferential properties without the need to select
important tuning parameters by hand, and we generally recommend this approach.
FIGURE 4.9: Estimated coefficient function β1 (s) in equation (4.4) and confidence intervals
obtained using refund::pfr() (top right panel) and FPCR based on the first four, eight,
and twelve eigenfunctions (remaining panels). The outcome is BMI and the predictors are
the MIMS profiles during the day.
126 Functional Data Analysis with R
where Zi is a Q × 1 dimensional vector of scalar covariates for subject i and γ is the vector
of associated coefficients. For non-penalized approaches, the design matrix X appearing
in (4.7) is augmented so that it contains an intercept, columns for the scalar predictors,
and numeric integrals stemming from the basis expansion. This can be fit using ordinary
least squares, and is easy to implement using lm(). For penalized approaches, we similarly
add columns containing scalar predictors to the design matrix and avoid penalization by
expanding the 0 entries of the matrix D in (4.11). Inference for scalar covariates is analogous
to that in non-functional settings, and relies on entries on the diagonal of the covariance
matrix of all coefficients.
The code chunk below regresses BMI on MIMS mat as a functional predictor and age and
gender as scalar covariates using pfr(); recall that pfr() expects functional predictors to
be structured as matrices. Except for the addition of age and gender in the formula, all
other elements of the model fitting code have been seen before. Moreover, all subsequent
steps build directly on previous code chunks. Functional coefficients and standard errors can
be extracted using coef and combined to obtain confidence intervals, and then plotted using
tools from tidyfun or other graphics packages. Estimates and inference for non-functional
coefficients can be obtained using summary() on the fitted model object. Each of these steps
is straightforward enough that they are omitted here.
Scalar-on-Function Regression 127
where Xir (s), 1 ≤ r ≤ R are functional predictors observed over domains Sr with associ-
ated coefficient functions βr (s). The interpretation of the coefficients here is similar to that
elsewhere; it is possible to interpret the effect of Xir (s) on the expected value of Yi through
the coefficient βr (s), keeping all other functional and non-functional predictors fixed. As
when adding scalar predictors to the “simple” scalar-on-function regression model in Sec-
tion 4.4.1, the techniques described previously can be readily adapted to this setting. Each
coefficient function βr (s) can be estimated using any of the above approaches by creating an
appropriate design matrix Xr . If more than one coefficient function is penalized, it is neces-
sary to construct a block diagonal penalty matrix D with diagonal elements that implement
that penalty and estimate the tuning parameter for each coefficient function. This structure
is analogous to the penalized approach to multivariate regression splines that appears in
equation (2.14) described in Section 2.3.1.3.
We will again use pfr() to implement an example of model (4.19). The code chunk be-
low regresses BMI on scalar covariates age and gender. The functional predictor MIMS mat
is familiar from many analyses in this section and is used here. We add the functional pre-
dictor MIMS sd mat, which is the standard deviation of MIMS values taken across several
days of observation for each participant, also stored as a matrix. MIMS sd mat is included
in the formula specification exactly as MIMS mat has been in previous code chunks, and
the subsequent extraction of estimates and inference is almost identical – with the key
128 Functional Data Analysis with R
FIGURE 4.10: The left and right panels show coefficient functions for MIMS mat and
MIMS sd mat, respectively, in a model with both functional predictors, BMI as an outcome,
and adjusting for age and gender. Coefficients are estimated using penalized splines through
pfr().
Scalar-on-Function Regression 129
where g(·) is the logit link function. Model parameters can be interpreted as log odd ratios or
exponentiated to obtain odds ratios. Expanding the coefficient function in terms of a basis
expansion again provides a mechanism for estimation and inference. Rather than minimizing
a sum of squares as in (4.7) or the penalized equivalent in (4.11), one now minimizes a
(penalized) log likelihood. Tuning parameters for penalized approaches can be selected in
a variety of ways, but we continue to take advantage of the connection between roughness
penalties and mixed model representations. That is, by viewing spline coefficients for β1 (s)
as random effects, we can estimate the degree of smoothness as a variance component and
estimate it using the associated mixed effects framework. We therefore can take advantage of
standard mixed model implementations to fit a broad range of scalar-on-function regression
models. Non-penalized approaches can be fit directly using glm(), while penalized models
are available through mgcv::gam() or pfr().
In the code chunk below, we fit a logistic scalar-on-function regression in the NHANES
dataset. Our binary outcome is two-year mortality, with the value 0 indicating that the
participant survived two years after enrollment. We adjust for age, gender, and BMI, and
focus on MIMS as a functional predictor. The model specification using pfr() sets the
argument family to binomial(), but other aspects of this code are drawn directly from
prior examples. Extracting estimated coefficient functions and conducting inference can be
accomplished using the coef() function to obtain estimates and pointwise standard errors.
Post-processing to construct confidence intervals is direct, and these can be inverse-logit
transformed to obtain estimates and inference as odds ratios. Because this code is un-
changed from earlier examples, it is omitted here.
The results of this analysis suggest that higher MIMS values in the daytime hours are
statistically significantly associated with a reduction in the risk of death within two years,
keeping other variables fixed. A more detailed analysis of this dataset using survival analysis
is presented in Chapter 7.
An important class of models allows for non-linearity. Single index models extend the
model (4.4) by including a smooth function around the integral term [2, 70, 85, 179], while
additive models estimate a bivariate coefficient surface in place of the univariate coefficient
function [198, 210]. Many other basis expansions are possible; using a wavelet basis (often
with variable selection methods for estimation) can be suitable when the coefficient function
is not expected to be smooth [191, 309, 340]. Variable selection methods have also been
used with spline bases to encourage sparsity and build interpretability in the estimated
coefficient [135]. Scalar-on-function regression has been extended to quantile regression,
which is necessary to study elements of the response distribution other than the expected
value [31, 40]. Quantile regression can be helpful when data are skewed, a setting also
considered in [176]. The association between scalar outcomes and functional predictors can
be assessed through non-parametric estimation techniques, for example through a functional
Nadaraya-Watson estimator or a reproducing kernel Hilbert space [81, 84, 233].
where qj are the quadrature weights associated with a particular numeric approximation of
the integral. For example,
p if sj , j = 1, . . . , p are equally spaced between 0 and 1, qj = 1/p.
Note that the sum j=1 {qj Xi (sj )}β1 (sj ) can be viewed as a sum over sj of β1 (sj ) weighted
by qj Xi (sj ). It turns out that mgcv can fit such structures using the “linear functional terms”
(see ?linear.functional.terms). To estimate this model, a user must specify that the
outcome, Yi , is a smooth function of the functional domain, sj , multiplied by the product of
the quadrature weight used for numeric integration, qj , and the functional predictor at the
corresponding point on the domain, Xi (sj ). This quantity is summed up over the observed
domain and then added to the linear predictor. A smoothness penalty is then automatically
applied, much in the same way as was shown in Section 4.2.3 using the paraPen argument,
Scalar-on-Function Regression 131
but without the need for the user to manually derive a penalty matrix and basis expansion
for β1 (s).
Estimating this model then boils down to the construction of the appropriate data in-
puts to supply to mgcv::gam() and identifying the syntax associated with the model we
would like to fit or penalized log likelihood we would like to optimize over. First consider
construction of the data inputs to mgcv::gam(). We require: (1) the vector of responses,
y = [y1 , . . . , yn ]t ; (2) the matrix associated with the functional domain s = 1n ⊗ st where
st = [s1 , . . . , sp ] is the row vector containing the domain of the observed functions (in the
case of the hourly NHANES data, p = 1440 and st = [1, . . . , 1440]/60) and ⊗ denotes the
Kronecker product; and (3) a matrix containing the quadrature weights associated with
each functional predictor P = 1n ⊗ Qt , where Qt = [q1 , . . . , qp ] is the row vector contain-
ing the quadrature weights; and lastly (4) a matrix containing the element-wise product
of the functional predictor and the quadrature weights XL = X P, where denotes
the element-wise product (Hadamard product) of two matrices. The code below constructs
these matrices and puts them in a data frame that we will pass to mgcv.
#Functional predictor
X <- nhanes df$MIMS mat
#Number of participants
N <- nrow(nhanes df)
#Vector containing functional domain of observed data
s vec <- seq(1/60, 24, length = 1440)
#Matrix containing domain for each person (row)
S <- kronecker(matrix(1, N, 1), t(s vec))
#Vector quadrature weights (Simpson’s rule)
q <- matrix((s vec[length(s vec)] - s vec[1]) / length(s vec) / 3 *
c(1, rep(c(4, 2), len = 1440 - 2), 1),
1440, 1)
#Matrix containing quadrature weights for each person (row)
L <- kronecker(matrix(1, N, 1), t(q))
#Functional predictor multiplied by quadrature weights, elementwise
X L <- X * L
df mgcv <-
data.frame(
X L = I(X L),
S = I(S),
y = nhanes df$BMI
)
The data frame for fitting the model directly using mgcv then contains
The SoFR model can then be fit via the function call below.
We now connect the syntax in the one line of code above, to the SoFR model formulation.
Again, this connection is the key insight which allowed for the development of functional
regression methodology based on the highly flexible penalized spline framework implemented
in the mgcv package. As with the refund::pfr() function, the first quantity specified is
the response variable in vector format, y, followed by the syntax which specifies the linear
predictor, separated by a tilde (∼). The s() function specifies that the response is a smooth
function of the variable(s) supplied as unnamed arguments (in the SoFR model, p this is s).
When the variable supplied is a matrix, mgcv adds to the linear predictor j=1 β1 (sij ) for
each row (response unit) i = 1, . . . , N (in the SoFR model with regular data, sij = si j = sj
for all i, i ∈ 1, . . . , N ). The variable supplied to the by argument, XL , which must have
the same dimension as the variables, indicates that the smooth terms should be multiplied
elementwise by the items in this matrix, qj Xi (sj ). Combining these two facts leads to a
linear predictor of the form shown in equation (4.22). The remaining arguments relate to
specification of the basis expansion for β1 (·). The argument bs="tp" specifies the use of
thin plate regression splines [313], while k=10 sets the dimension of the basis to be 10.
We emphasize that the default behavior of the refund::lf() function used by
refund::pfr() to fit SoFR in Section 4.2.3 is to use the default arguments of mgcv::s(),
which is to use a thin plate regression spline basis (bs="tp") of dimension k=10. In practice,
users should explore sensitivity to results to increasing k, until the estimated function is well
approximated and remains stable when further increasing k. For most applications a value
of k of 10-30 is enough, with some applications requiring larger values. This can be readily
checked by examining the edf column associated with β(s) in the summary.gam output. In
the case of the NHANES data, the function is well approximated by a smooth of degree 6.9,
indicating a degree 10 penalized smooth is likely sufficient. Indeed, increasing the number
of basis functions in this example does not appreciably change the shape of the estimated
coefficient.
Quantities of interest (e.g., point estimates and standard errors) can be obtained from
this fitted object. In addition, calling the summary.gam() method on the fitted object pro-
vides a host of useful information related to model fit. R users will find the structure of the
summary output very similar to what is obtained from (generalized) linear regression fits
using the R functions lm() and glm(). Moreover, this software structure should immediately
indicate the extraordinary flexibility and wider implications of this approach. Specifying the
type of spline, number of knots, estimation method for smoothing parameters, adding scalar
and functional covariates, and changing the outcome distribution, follow immediately from
this mgcv implementation.
Suppose that we wish to construct pointwise 95% confidence intervals on a set of points
over the functional domain, concatenated in a column vector denoted as spred . The number
of points we want to obtain predictions on is then |spred |, the length of the vector spred .
Continuing the notation for basis matrices used previously in this chapter, let B(spred ) be
the |spred |×K matrix containing the basis used for estimating β1 (s) evaluated at spred . Then
β1 (spred ) = B(spred )β
1 is the estimated coefficient function evaluated at spred . It follows
that SE{β1 (spred )} = diag{B(spred )Var(β1 )Bt (spred )}. This quantity can be obtained di-
rectly from mgcv::predict.gam using the appropriate data inputs. There are multiple ways
to do this, but the most straightforward one is to supply predict.mgcv with a data frame
that contains qj Xi (sj ) = 1 and s = spred . The code below shows how to do this for the case
where spred is a regular grid of length 100 on [0, 1].
The mgcv package contains functionality which allows users to obtain point estimates
and standard errors for linear functions of each term included in the model, separately.
This applies to non-linear functions estimated using penalized splines, as is the case with
β1 (s). To obtain the point estimates β1 (s) and their corresponding standard errors using
mgcv::gam(), the correct data inputs must be supplied to the predict.gam() function.
Specifically, levels for all predictors used in model fitting must be supplied.
In the SoFR
model we fit, the only predictor in the model was the linear functional term S β1 (s)Xi (s)ds,
so only predictors associated with this term need be included. Recall that the syntax used
to specify the linear function term was s(S, by = X L, bs = "tp", k = 10). The data
inputs for this term are the objects S and X L, the matrices containing the functional domain
(s) and elementwise product of the quadrature weights and the functional predictor (XL ),
respectively. The predict.gam function will evaluate each term of the model at the values
supplied to the newdata argument. In the example above, we specify that we wish to evaluate
β1 (s) at spred with XL = 1. Returning to the numeric approximation approach to estimating
the linear functional term, we have
p
E[Yi |{Xi : s ∈ S}] ≈ β0 + qj Xi (sj )β1 (sj ) .
j=1
So if for a fixed sj , we obtain predictions with Xi (sj )qj = 1, we obtain β1 (sj ). By setting
XL = 1 in the data frame df pred, we do exactly this. Thus, mgcv will provide point esti-
mators and their standard errors for β1 (s) for fixed points on the domain (hence pointwise
confidence intervals). The lower and upper bounds for the 95% pointwise confidence interval
constructed using the code above can be used to exactly re-create the upper left panel of
Figure 4.9, the point estimate and 95% confidence intervals obtained from refund::pfr().
134 Functional Data Analysis with R
Having supplied the correct data inputs to predict.gam() with the argument
se.fit=TRUE, the object returned is a list with two elements. The first element contains a
matrix with the point estimates for each term in the model, the second contains a matrix
of the same dimension with the corresponding standard errors for the point estimates con-
tained in the first returned element. In our example we only had one term in the model,
so we extracted the first column. In a model with more terms, one would need to either
manually extract the correct column based on the order in which the linear functional term
was specified, or by using regular expressions.
confidence intervals based on parameter simulations introduced in Section 2.4.2 and joint
confidence intervals based on the max absolute statistics introduced in Section 2.4.3. We
also discuss some potential pitfalls associated with using the PCA or SVD decomposition
of the covariance estimator of β1 (spred ).
Algorithm 1 Algorithm for Simulations from the Spline Parameter Distribution: SoFR
Input
B, B(spred ), β1 , Var(β1 ), Z = max{|β1 (s)|/SE{β1 (s)} : s ∈ spred }
Output
pgCMA = p(maxs∈spred {|β1 (s)|/SE{β1 (s)}} ≥ |Z||H0 : β1 (s) = 0, ∀s ∈ spred )
db , b = 1, . . . , B simulations from the distribution of maxs∈spred {|β1 (s)|/SE{β1 (s)}|H0 }
for b = 1, . . . , B do
β1b ∼ N {β1 , Var(β1 )}
β1b (spred ) = B(spred )β1b
db = maxs∈spred |β1b (s) − β1 (s)|/SE{β1 (s)}
end for B
pCMA = max{B −1 , B −1 b=1 db > max{|β1 (s)|/SE{β1 (s)}| : s ∈ spred }}
with the approximation due to the variability associated with the simulation procedure
and the normal approximation of the distribution of β. Here we have used the notation
q(Cβ , 1 − α) to indicate the dependence of the quantile on the correlation matrix Cβ of
β1 (spred ) and to keep the notation consistent with the one introduced in Section 2.4.
It follows that the 1 − α level correlation and multiplicity adjusted (CMA) confidence
interval is
β1 (s) ± q(Cβ , 1 − α)SE{β1 (s)}
for all values s ∈ spred . Conveniently, CMA confidence intervals can be inverted to form both
pointwise and global CMA p-values, allowing for straightforward evaluation of inference
accounting for the correlated nature of tests along the domain. First, consider the procedure
for constructing pointwise CMA p-values. For a fixed s, we can simply find the smallest value
of α for which the above interval does not contain zero (the null hypothesis is rejected).
We denote this probability by ppCMA (s) and refer to it as the pointwise correlation and
multiplicity adjusted (pointwise CMA) p-value. To calculate the global pointwise correlation
136 Functional Data Analysis with R
and multiplicity adjusted (global CMA) p-value we define the minimum α level at which at
least one confidence interval β1 (s) ± q(Cβ , 1 − α)SE{β1 (s)} for s ∈ spred does not contain
zero. We denote this p-value by pgCMA (spred ). As discussed in Section 2.4.4, it can be shown
that
pgCMA (spred ) = min{ppCMA (s) : s ∈ spred } .
Calculating both the pointwise and global CMA adjusted p-values requires only one
simulation to obtain the distribution of db , though a large number of bootstrap samples,
B, may be required to estimate extreme p-values. Here we set B = 107 to illustrate a
point comparing p-values obtained using different methods when p-values are very small. In
general, if we are only interested in cases when the p-value is < 0.001, B = 104 simulations
should be enough.
Even with the relatively large number of simulations, the entire procedure is fairly fast as
it does not involve model refitting, simply simulating from a multivariate normal of reason-
able dimension, in our case K dimensional. Below we show how to conduct these simulations
and calculate the CMA adjusted confidence intervals and global p-values. An essential step
is to extract the covariance matrix of β1 from the mgcv fit. This is accomplished in the ex-
pression Vbheta <- vcov(gam fit)[inx beta,inx beta], where inx beta are the indices
associated with the spline coefficients used to estimate β1 (·). The code chunk below focuses
on obtaining the CMA confidence intervals at a particular confidence interval, in this case
α = 0.05.
We now show how to invert these confidence intervals and obtain the pointwise ad-
justed CMA p-values, {ppCMA (s) : s ∈ spred }, as well as the global CMA adjusted p-value
pgCMA (spred ). In practice all p-values we estimate are limited by the number of simulations
we use.
Scalar-on-Function Regression 137
The pointwise adjusted p-values are stored in the vector p val lg while ppCMA (spred ) is
stored in the scalar (p val g). Based on B = 107 simulations we obtain a global CMA p-
value of 1×10−5 for testing the null hypothesis of 0 effect. Figure 4.11 displays the results of
the CMA inference. The left panel of Figure 4.11 presents both the unadjusted and pointwise
CMA adjusted p-values {ppCMA (s) : s ∈ spred } in dark and light gray, respectively. The
dashed gray line corresponds to probability 0.05. The right panel of Figure 4.11 presents
the unadjusted and pointwise CMA adjusted 95% confidence intervals, again in dark and
light gray, respectively. The dashed gray line in this panel corresponds to zero effect. We find
that after adjusting for correlations and multiple comparisons, the three broad periods of
time identified as significant by the unadjusted pointwise inference persist (early morning,
late morning, and early evening), however the exact periods of time in which activity is
significantly associated with BMI is reduced due to the widening of confidence intervals.
FIGURE 4.11: Pointwise CMA inference for SoFR based on simulations from the distribu-
tion of spline coefficients. BMI is the outcome and PA functions are the predictors. Left
panel: estimated pointwise unadjusted (dark gray) and CMA (light gray) p-values denoted
by ppCMA (s). Right panel: the 95% pointwise unadjusted (dark gray) and CMA confidence
intervals for β(s).
{β1b (s) : 1 ≤ b ≤ B, s ∈ spred } is the collection of bootstrap estimates over the grid spred .
As with the simulations from the spline parameter distribution, we then obtain
This procedure requires extracting the variances of β1 (spred ), but not the entire covariance
matrix. However, it requires refitting the model B times, which increases the computational
complexity, increasing the time to obtain extreme p-values.
The code below illustrates how to perform the non-parametric bootstrap with repeated
model estimation done using mgcv::gam()instead of refund::pfr(). For SoFR, either func-
tion is equally easy to use for estimation given the wide format storage of the data used as
input for both functions (as compared to function-on-function regression; see Chapter 6).
However, as of this writing, pfr is less straightforward to use, as it requires a function call
to predict.gam. This is not an inherent limitation of pfr, but rather a design decision
made when designing the predict.pfr method.
Given the matrix of non-parametric resampled estimated β1b (spred ) contained in the
matrix beta mat boot np above, the calculation of pointwise CMA adjusted and global
p-values (along with their corresponding confidence intervals) proceeds exactly the same
as with the non-parametric bootstrap and thus the code to do so is omitted here. In this
particular example, the non-parametric bootstrap yields qualitatively similar results when
compared to the simulations from the spline parameter distribution. Figure 4.12 presents
the results in the same format as Figure 4.11. We note that the confidence intervals are
slightly wider, resulting in the early morning associations not being statistically significant
at the 0.05 level when conducting CMA inference. In addition, the late morning period loses
much of its statistical significance.
Scalar-on-Function Regression 139
FIGURE 4.12: Pointwise CMA inference for SoFR based on the nonparametric bootstrap
of the max statistic. BMI is the outcome and PA functions are the predictors. Left panel:
estimated pointwise unadjusted (dark gray) and CMA (light gray) p-values denoted by
ppCMA (s). Right panel: the 95% pointwise unadjusted (dark gray) and CMA confidence
intervals for β(s).
Consider the case when the distribution of β1 is well approximated by a multivari-
ate normal distribution; this can be justified either based on the distribution of the data
or by asymptotic considerations. We have already seen that in this application, this as-
sumption is probably incorrect. Regardless, we would like to find an analytic solution
to this problem and compare the results with those obtained via simulations from the
spline coefficients distribution. Under the assumption of normality of β1 it follows that
β1 (spred ) = B(spred )β1 has a multivariate normal distribution with covariance matrix
Var{β1 (spred )} = B(spred )Var(β1 )Bt (spred ). Using this fact, one could proceed with a de-
composition approach to obtain CMA confidence intervals as well as pointwise and global
p-values for testing the null hypothesis: H0 : β1 (s) = 0, ∀s ∈ spred .
Using the same argument discussed in Section 2.4.1
where Cβ is the correlation matrix corresponding to the covariance matrix Var{β1 (spred )} =
B(spred )Var(β1 )Bt (spred ). As we discussed in Section 2.4.1, we need to find a value q(Cβ , 1−
α) such that
P {q(Cβ , 1 − α) × e ≤ X ≤ q(Cβ , 1 − α) × e} = 1 − α ,
where X ∼ N (0, Cβ ) and e = (1, . . . , 1)t is the |spred | × 1 dimensional vector of ones. Once
q(Cβ , 1 − α) is available, we can obtain a CMA 1 − α level confidence interval for β1 (s) as
Luckily, the function qmvnorm in the R package mvtnorm [96, 97] is designed to extract such
quantiles. Unluckily, the function does not work for matrices Cβ that are singular and very
high dimensional. Therefore we need to find a theoretical way around the problem.
Indeed, β1 (spred ) = B(spred )β1 has a degenerate normal distribution, because its rank
is at most K, the number of basis functions used to estimate β1 (s). Since we have evaluated
β1 (s) on a grid of |spred | = 100, the covariance and correlation matrices of β1 (spred ) are 100
dimensional. To better understand this, consider the case where β1 (s) = c + s for c ∈ R.
Then rank(Cβ ) ≤ 2 for any choice of spred .
We will use a statistical trick based on the eigendecomposition of the covariance matrix.
Recall that if β1 (spred ) has a degenerate multivariate normal distribution of rank m ≤ K,
then there exists some random vector of independent standard normal random variables
Q ∈ Rm such that
β1t (spred ) = Qt D ,
with Dt D = B(spred )Var(β1 )Bt (spred ). If we find m and a matrix D with these prop-
erties, the problem is solved, at least theoretically. Consider the eigendecomposition
B(spred )Var(β1 )Bt (spred ) = UΛUt , where Λ is a diagonal matrix of eigenvalues and the
matrix UUt = I|Upred |×|spred | is an orthonormal matrix with the kth column being the
eigenvector corresponding to the kth eigenvalue. Note that all eigenvalues λk = 0 for k > m
and
B(spred )Var(β1 )Bt (spred ) = Um Λm Utm ,
where Um is the |spred | × m dimensional matrix obtained by taking the first m columns of
U and Λm is the m × m dimensional diagonal matrix with the first m eigenvalues on the
1/2
main diagonal. If we define Dt = Um Λm and
Unfortunately, this p-value is not equal to the p-value obtained from simulating from the
spline parameters normal distribution. In fact, the p-value obtained using this method is
approximately 0, effectively the same p-value reported by summary.gam in the mgcv package
(p < 2 × 10−16 ). The p-value reported by mgcv is based on a very similar approach which
modifies slightly the matrix of eigenvalues of the covariance operator [318] and derives a
Wald test statistic which very closely matches the sum of the squared Q, a χ2 random
variable under the null hypothesis. In this example, all three approaches would yield the
same inference in practice (a statistically significant association), but the discrepancy, which
is orders of magnitude, is cause for concern. Indeed, the problem becomes more concerning
when considering the results of the multivariate normal approximation, presented below,
which provides a p-value that agrees quite closely with the simulations from the spline
parameter distribution.
obtain inference using software for calculating quantiles of a multivariate normal random
vector (e.g., mvtnorm::pmvnorm). The code to do so is provided below.
#Get hat(beta)/SE(hat(beta))
beta hat std <- beta hat / se beta hat
#Get Var(hat(beta)/SE(hat(beta)))
Vbeta hat <- Bmat %*% Vbeta %*% t(Bmat)
#Jitter to make positive definite
Vbeta hat PD <- Matrix::nearPD(Vbeta hat)$mat
#Get correlation function
Cbeta hat PD <- cov2cor(matrix(Vbeta hat PD, 100, 100, byrow = FALSE))
#Get max statistic
Zmax <- max(abs(beta hat std))
#p-value
p val <- 1 - pmvnorm(lower = rep(-Zmax, 100), upper = rep(Zmax, 100),
mean = rep(0, 100), cor = Cbeta hat PD)
The resulting p-value is 2.9 × 10−6 , which is very close to that obtained from the sim-
ulations from the spline parameter distribution. In principle, this method should yield the
same result as the exact solution, though we find with some consistency that it does not.
In our experience, this approach provides results which align very well with the simulations
from the spline parameter distribution, though the theoretical justification for jittering of
the covariance function is, as of this writing, not justified. To construct a 95% CMA global
confidence interval one may use the mvtnorm::qmvnorm function, illustrated below. Un-
surprisingly, the resulting global multiplier is almost identical to that obtained from the
parametric bootstrap. For that reason we do not plot the results or interpret further.
Z global <- qmvnorm(0.95, mean = rep(0,100), cor = Cbeta hat PD, tail =
"both.tails")$quantile
5
Function-on-Scalar Regression
We now consider the use of functions as responses in models with scalar predictors. This
setting is widespread in applications of functional data analysis, and builds on specific tools
and broader intuition developed in previous chapters. We will focus on the linear Function-
on-Scalar Regression (FoSR) model, with a brief overview of alternative approaches in later
sections.
FoSR is known under different names and was first popularized by [242, 245] who in-
troduced it as a functional linear model with a functional response and scalar covariates;
see Chapter 13 in [245]. Here we prefer the more precise FoSR nomenclature introduced
by [251, 253], which refers directly to the type of outcome and predictor. It is difficult
to pinpoint where these models originated, but it is likely that the life of FoSR models
has multiple origins, most likely driven by applications. One of the origins is intertwined
with the introduction of linear mixed effects (LME) models for longitudinal data [161].
While mixed effects models have a much longer history, the Laird and Ware paper [161]
summarized and introduced the modern formalism of mixed effects models for longitudi-
nal data. Linear mixed models have traditionally focused on a sparsely observed functional
outcome and scalar predictors and use specific known structures of random effects (e.g.,
random slope/random intercept models); for more details see [66, 87, 231, 301]. Another
point of origin was the inference for differences between the means of groups of functions
[27, 79, 243, 244], as described in Section 13.6 of [245]. These approaches tended to be more
focused on the functional aspects of the problem and allowed more flexibility in modeling
the random effects as smooth functions. In this book we will unify the data generating
mechanisms and methods under the mixed effects models umbrella, though some random
effects will be used for traditional modeling of trajectories, while others will be used for
nonparametric smoothing of the functional data. We will also emphasize that unifying soft-
ware can be used for fitting such models and point out the various areas that are still open
for research.
FoSR has been applied extensively to areas of scientific research including brain connec-
tivity [251], diffusion brain imaging [103, 109, 280], seismic ground motion [11], CD4 counts
in studies of HIV infection studies [77], reproductive behavior of large cohorts of medflies
[44, 45], carcinogenesis experiments [207], knee kinematics [6], human vision [218], circadian
analysis of cortisol levels [114], mass spectrometry proteomic data [206, 342, 343, 204], eye
scleral displacement induced by intraocular pressure [162], electroencephalography during
sleep [53], feeding behavior of pigs [100], objective physical activity measured using ac-
celerometers [57, 265, 273, 327], phonetic analysis [8], and continuous glucose monitoring
[92, 270], to name a few. Just as discussed in Chapter 4, these papers are referenced here
for their specific area of application, but they contain substantial methodological develop-
ments that could be explored in detail. Throughout this section, we will focus on methods
for the linear FoSR model. Many additional methods exist, including the Functional Linear
Array Model (FLAM) [23, 25, 24], Wavelet-based Functional Mixed Models (WFMM) [207],
Functional Mixed Effects Modeling (FMEM) [337, 344], longitudinal FoSR using structured
penalties [159], and others noted in a recent review of functional regression methods [205].
143
144 Functional Data Analysis with R
We will not discuss these approaches here, but they provide alternative model structures
and estimation methods that should be considered.
The overall goal of this chapter is not to explore the vast array of published methodolog-
ical tools for FoSR. Instead, we will focus on a specific group of methods that use penalized
splines to model functional effects, connect these models with linear mixed effects models,
and show how to implement these methods in software such as refund and mgcv.
nhanes df =
readRDS(
here::here("data", "nhanes fda with r.rds")) %>%
select(SEQN, gender, age, MIMS mat = MIMS) %>%
mutate(
age cat =
cut(age, breaks = c(18, 35, 50, 65, 80),
include.lowest = TRUE),
SEQN = as.factor(SEQN)) %>%
drop na(age, age cat) %>%
filter(age >= 25) %>%
tibble() %>%
slice(1:250)
In the next code chunk, the MIMS mat variable is converted to a tidyfun [261] object
via the tfd() function. As elsewhere, 1we 2use the argument in tfd() to define
arg the grid
over which functions are observed: 60 , 60 , .... 1440
60 , so that minutes are in 1
60 increments
and hours of the day fall on integers from 1 to 24. After transformation the MIMS tf vari-
able is a vector that contains all functional observations and makes many operations easier
in the tidyverse [311]. This transformation is not strictly necessary, and throughout the
book we also show how to work directly with the matrix format. However, the tidyverse
has become increasingly popular in data science and we illustrate here that functional data
analysis can easily interface with it.
Function-on-Scalar Regression 145
nhanes df =
nhanes df %>%
mutate(
MIMS tf = matrix(MIMS mat, ncol = 1440),
MIMS tf = tfd(MIMS tf, arg = seq(1/60, 24, length = 1440)))
Once data are available in a data frame, one can start to visualize some of their prop-
erties. Suppose that one is interested in how the average objectively measured physical
activity over a 24-hour interval varies with age and gender. Using the tidyfun package, the
following code shows how to do that using a tidy syntax. Beginning with the NHANES data
stored in nhanes df, we group variables by age category and gender using the group by
function and obtain the 24-hour means for each subgroup using the summarize function.
The next component of the code displays the individual mean functions using the ggplot()
function and the geom spaghetti() function from tidyfun.
nhanes df %>%
group by(age cat, gender) %>%
summarize(mean mims = mean(MIMS)) %>%
ggplot(aes(y = mean mims, color = age cat)) +
geom spaghetti() +
facet grid(. ∼ gender) +
scale x continuous(breaks = seq(0, 24, length = 5)) +
labs(x = "Time of day (hours)", y = "Average MIMS")
Figure 5.1 displays the results of the preceding code chunk. Individual means are color
coded by age category, as indicated by the color = age cat aesthetic mapping and sepa-
rated into two panels corresponding to males and females, as indicated by the facet grid()
function. The means of the objectively measured physical activity exhibit clear circadian
rhythms with more physical activity during the day and less during the evening and night.
One can also notice a strong effect of age indicating that older individuals tend to be less
active than younger individuals on average. This trend is clearer among men, though women
in the (65, 80] age range (shown in the right panel in yellow) exhibit, on average, much lower
physical activity than younger women. When comparing men to women in the same age
category (same color lines in the left and right panels, respectively), results indicate that
women are more active. This is consistent with the findings reported in [327] in the Balti-
more Longitudinal Study of Aging (BLSA) [265, 284, 327], though in direct contradiction
with one of the main findings reported in the extensively cited paper [296] based on the
2003-2005 NHANES accelerometry study.
Exploratory plots like this one are useful for beginning to understand the effects of
covariates on physical activity, but are limited in their scope. For example, they cannot
be used to adjust for covariates or assess statistical significance, especially in cases when
the effects of covariates are not as obvious as those of age and gender. Also, the group-
specific mean functions are quite noisy because these are the raw means in each group
without accounting for smoothness across time. These are some of the issues that more
formal approaches to function-on-scalar regression are intended to address.
FIGURE 5.1: Pointwise means of MIMS in the NHANES data for four age categories (in-
dicated by different color) and gender (male/female) shown in the left and right panels,
respectively.
trajectories, we smooth minute-level data using a moving average approach with a 60-minute
bandwidth, and evaluate the resulting functions at the midpoint of each hour.
nhanes df =
nhanes df %>%
mutate(
MIMS hour =
tf smooth(MIMS, method = "rollmean", k = 60, align = "center"),
MIMS hour = tfd(MIMS hour, arg = seq(.5, 23.5, by = 1)))
The resulting MIMS hour data are shown in Figure 5.2, while the code for obtaining the
figure is shown in the code chunk below. We pipe the dataset into ggplot(), and plotting
continues using geom spaghetti() to show functions and adding geom meatballs() to
place a dot at every point. This is done both for aesthetic purposes as well as to emphasize
the discrete nature of the data.
nhanes df %>%
ggplot(aes(y = MIMS hour, color = age)) +
geom spaghetti(alpha = .2) +
geom meatballs(alpha = .2) +
facet grid(. ∼ gender)
As before, we are interested in the effects of age, now treated as a continuous variable,
and gender. The binning process retains a level of granularity that is informative regarding
the diurnal patterns of activity but reflects a substantial reduction in the dimension and
detail in the data.
These data can be analyzed using hour-specific linear models that regress bin-average
MIMS values on age and gender. This collection of linear models does not account for the
Function-on-Scalar Regression 147
temporal structure of the diurnal profiles except through the binning that aggregates data
to an hour level, but taken together will illustrate the association between the outcome and
predictors over the course of the day. As a first example, we will fit a standard linear model
with the average MIMS value between 1:00 and 2:00 PM as a response. The model is
where MIMSi represents the average MIMS value between 1:00 and 2:00 PM. For presen-
tation simplicity we avoided adding an index for the time period for the outcome, model
parameters and error process.
In the next code chunk, which fits this bin-specific linear model, the first two lines are self-
explanatory. The tf unnest() function in the tidyfun package transforms the nhanes df
dataframe, which includes MIMS hour as a tidyfun vector, into a long-format dataframe
with columns for the functional argument (i.e., hour 0.5, 1.5, . . . , 23.5) and the correspond-
ing functional value (i.e., the subject bin-specific average). The filter() function subsets
the data to include only the mean MIMS between 1:00 and 2:00 PM (the 13.5th hour is the
middle of this interval when time is indexed starting at midnight). The resulting dataset
is passed as the data argument into lm() using the “placeholder,” with the appropriate
formula specification for the desired model.
linear fit =
nhanes df %>%
select(SEQN, age, gender, MIMS hour) %>%
tf unnest(MIMS hour) %>%
filter(MIMS hour arg == 13.5) %>%
lm(MIMS hour value ∼ age + gender, data = .)
148 Functional Data Analysis with R
TABLE 5.1
Regression results for regressing the mean MIMS between
1:00 and 2:00 PM on age and gender.
term estimate std.error statistic p.value
(Intercept) 16.442 1.027 16.003 0.000
age −0.069 0.017 −3.953 0.000
genderFemale 0.963 0.592 1.626 0.105
The results of fitting the linear model are in the following table. Because the age variable
is not centered, the intercept is the expected average MIMS between 1:00 and 2:00 PM
among males at age 0. The estimated age coefficient implies a decrease in the expected
average MIMS in this hour of −0.07 for each one-year increase in age among men, and is
strongly statistically significant. Women have, on average, higher MIMS values than men in
this time window, although the difference is suggestive rather than statistically significant.
We illustrate this analysis graphically in the next figure, using code we briefly describe
but omit. In short, fitted values resulting from the regression were added to the nhanes df
data frame. The plot was constructed using ggplot(), with data points in the scatterplot
shown using geom point() and the fitted values illustrated using geom line().
The scatterplot and regression results are consistent with our previous observations: the
binned-average physical activity decreases with age and is higher for female compared to
male participants across the age groups. The benefit of this analysis over visualization-based
exploratory techniques is that it provides a formal statistical assessment of these effects and
their significance.
So far we have shown regression results for the 1:00 to 2:00 PM interval, but these asso-
ciations may vary by the hour of the day. The next step in our exploratory analysis is to fit
separate regressions at each hour separately. We accomplish this using data nested within
each hour. First, we will unnest the subject-specific functional observation and then re-nest
FIGURE 5.3: Age (x-axis) versus average MIMS (y-axis) between 1:00 and 2:00 PM for
males (dark purple) and females (yellow). Each dot corresponds to a study participant
and regression lines are added for males and females. The analysis is conducted for 250
individuals for didactic purposes.
Function-on-Scalar Regression 149
within an hour; the result is a data frame containing 24 rows, one for each hour, with a
column that contains a list of hour-specific data frames containing the MIMS hour value,
age, and gender. By mapping over the entries in this list, we can fit hour-specific linear
models and extract tidied results. The code chunk below implements this analysis.
hourly regressions =
nhanes df %>%
select(SEQN, age, gender, MIMS hour) %>%
tf unnest(MIMS hour) %>%
rename(hour = MIMS hour arg, MIMS = MIMS hour value) %>%
nest(data = -hour) %>%
mutate(
model = map(.x = data, ∼ lm(MIMS ∼ age + gender, data = .x)),
result = map(model, broom::tidy)
) %>%
select(hour, result)
Before visualizing the results, we do some data processing to obtain hourly confidence
intervals and then structure coefficient estimates and confidence bands as tf objects. The
result is stored in the variable hour bin coefs, which is a three-row data frame, where each
row corresponds to the intercept, age, and gender effects, respectively. Columns include the
term name as well as the coefficient estimates and the upper and lower limits of the confi-
dence bands.
We show the analysis results using ggplot() and other tidyfun functions. Because the
coefficient estimates in the variables stored in coef are tf objects, they can be plotted us-
ing geom spaghetti(). To emphasize the estimates at each hour, we add points using the
geom meatballs() function. Confidence bands are shown as shaded regions by specifying
the upper and lower limits in geom errorband(), and we use the facet() function by term
to plot each coefficient separately.
Figure 5.4 displays the point estimators and 95% confidence intervals for the coefficients
for regressions of hourly mean MIMS on age and gender. Compared to the regression at
a single time point, these results provide detailed temporal information about covariate
effects. The left panel corresponds to the intercept and is consistent to a fairly typical
150 Functional Data Analysis with R
FIGURE 5.4: Point estimators and 95% confidence intervals for the regression coefficients
for hourly regressions of mean MIMS on age and gender. Each regression is conducted
independent of the other regressions and results are shown for the intercept, age, and gender
parameters, respectively.
circadian pattern, with low activity during the night and higher activity during the day.
The middle panel displays the age effect adjusted for gender and indicates that older indi-
viduals are generally less active at all times (note that point estimators are all negative).
Moreover, age has a stronger effect on physical activity in mid-afternoon and evening (note
the decreasing pattern of estimated effects as a function of time of the day). This result
based on the NHANES data confirms similar results reported by [265, 331] in the Baltimore
Longitudinal Study of Aging [284]. The right panel displays the association between gender
and objectively measured physical activity after adjusting for age. For this data set results
are consistent with less activity for women during the night (potentially associated with less
disturbed sleep) and more activity during the day. The associated confidence bands suggest
that not all of these effects may be statistically significant – there may not be a significant
effect of age in the early morning, for example – but at many times of the day there do
seem to be significant effects of age and gender on the binned-average MIMS values. Fi-
nally, we note that the confidence bands are narrowest in the nighttime hours and widest
during the day, which is consistent with the much higher variability of the physical activity
measurements during the day illustrated in Figure 5.2.
The hour-level analysis is an informative exploratory approach, but has several limita-
tions. Most obviously, it aggregates data within prespecified bins, and in doing so loses some
of the richness of the underlying data. That aggregation induces some smoothness by relying
on the underlying temporal structure, but this smoothness is implicit and dependent on the
bins that are chosen – adjacent coefficient estimates are similar only because the underlying
data are similar, and not because of any specific model element. To emphasize these points,
we can repeat the bin-level analysis using ten-minute and one-minute epochs. These can be
implemented using only slight modifications to the previous code, in particular by changing
the bin width of the rolling average and the grid of argument values over which functions are
observed. We therefore omit this code and focus on the results produced in these settings.
Function-on-Scalar Regression 151
FIGURE 5.5: Point estimators and 95% confidence intervals for the regression coefficients
for ten-minute regressions of mean MIMS on age and gender. Each regression is conducted
independent of the other regressions and results are shown for the intercept, age, and gender
parameters, respectively.
Figures 5.5 and 5.6 display the same results as Figure 5.4, but using ten- and one-minute
intervals, respectively, instead of the one-hour intervals. As the bins are getting smaller, the
point estimators are becoming more variable, though the overall trends and magnitudes of
point estimators remain relatively stable. Such results are reassuring in practice as they show
consistency of results and provide a useful sensitivity analysis. The increased variability of
point estimators (wigglier curves) and the larger confidence intervals (shaded areas around
the curves) make graphs less appealing as the resolution at which analysis is conducted
increases.
These graphs indicate some of the difficulties inherent in binning-based analyses because
they (1) do not leverage the temporal structure directly in the estimation process; (2) do
not induce smoothness of effects across time; and (3) do not account for potentially large
within-subject correlations of residuals. Another, more subtle, issue is that we implicitly
rely on curves being observed densely over the same grid, or, less stringently, that a rolling
mean is a plausible way to generate binned averages. This approach may not work when data
are sparse or irregular across subjects. Various approaches to function-on-scalar regression
attempt to solve some of these issues, and can work more or less well depending on the
characteristics of the data to which they are applied.
FIGURE 5.6: Point estimators and 95% confidence intervals for the regression coefficients
for one-minute regressions of mean MIMS on age and gender. Each regression is conducted
independent of the other regressions and results are shown for the intercept, age, and gender
parameters, respectively.
of interest are age in years (Zi1 ) and a variable indicating whether participant i is male
(Zi2 = 0) or female (Zi2 = 1). The linear function-on-scalar regression for this setting is
with coefficients βq : S → R, q ∈ {0, 1, 2} that are functions measured over the same
domain as the functional response Wi (·). Scalar covariates in this model can exhibit the same
degree of complexity as in non-functional regressions, allowing any number of continuous
and categorical predictors of interest.
Coefficient functions encode the varying association between the response and predictors,
and are interpretable in ways that parallel non-functional regression models. In particular,
β0 (s) is the expected response for Zi1 = Zi2 = 0; β1 (s) is the expected change in the
response for each one unit change in Zi1 while holding Zi2 constant; and so on. These are
often interpreted at specific values of s ∈ S to gain intuition for the associations of interest.
In the NHANES data considered so far, for example, coefficient functions can be used to
compare the effect of increasing age on morning and evening physical activity, keeping
gender fixed.
The linear FoSR model addresses the concerns our exploratory analysis raised. Because
coefficients are functions observed on S, they can be estimated using techniques that ex-
plicitly allow for smoothness across the functional domain. This smoothness, along with
appropriate error correlation structures, provides an avenue for statistical inference under
a clearly defined set of assumptions. Considering coefficients as functions also opens the
possibility for specifying complex data generating mechanisms, such as responses that are
observed on grids that are sparse or irregular across subjects.
Function-on-Scalar Regression 153
where Zi = [1, Zi1 , Zi2 ] is the row vector containing scalar terms that defines the regression
model. This expression is useful because it connects a 1×p response vector to a recognizable
row in a standard regression design matrix and the matrix of functional coefficients.
Estimation of coefficients will rely on approaches that have been used elsewhere
with nuances
K specific to the FoSR setting. As a starting point, we will expand each
βq (s) = k=1 βqk Bk (s) using the basis B1 (s), . . . , BK (s). Here we have used the same
basis B1 (s), . . . , BK (s) for all three coefficients, though this is not necessary in specific
applications. We leave the problem of dealing with different bases and the associated no-
tational complexity as an exercise. While many choices are possible, we will use a spline
expansion.
Conveniently, one can concisely combine and rewrite the previous expressions. Let
B(sj ) = [B1 (sj ), . . . , BK (sj )] be the 1 × K row vector containing the basis functions
evaluated at sj and B(s) be the p × K matrix with the jth row equal to B(sj ). Fur-
ther, let βq = [βq1 , . . . , βqK ]t be the K × 1 vector of basis coefficients for function q and
β = [β0t , β1t , β2t ]t be the (3K) × 1 dimensional vector constructed by stacking the vectors of
basis coefficients. For the p × 1 response vector Wi , we have
E[Wi ] = [B(s) ⊗ Zi ]β
E[W] = [Z ⊗ B(s)]β .
predictors, and obtaining the response vector W unnests the tf vector and then extracts the
observed responses for each subject.
Z des =
model.matrix(
SEQN ∼ gender + age,
data = nhanes df
)
W =
nhanes df %>%
tf unnest(MIMS) %>%
pull(MIMS value)
basis =
splines::bs(epoch arg, df = 30, intercept = TRUE)
spline coefs = solve(t(Z kron B) %*% Z kron B) %*% t(Z kron B) %*% W
OLS coef df =
tibble(
method = "OLS",
term = colnames(Z des),
coef =
tfd(
t(basis %*% matrix(spline coefs, nrow = 30)),
arg = epoch arg)
)
Figure 5.7 displays the estimators of the time-dependent intercept, age, and gender
effects in model (5.1) obtained by expansion of these coefficients in a B-spline basis with 30
degrees of freedom and OLS regression. The outcome is physical activity at every minute.
Function-on-Scalar Regression 155
FIGURE 5.7: Coefficient estimates for the intercept, age and gender effects in model (5.1)
obtained by expansion of these coefficients in a B-spline basis with 30 degrees of freedom
and OLS regression.
These results can be compared to the results in Figures 5.5 and 5.6. Comparing these
results to those obtained from epoch-specific regressions begins to indicate the benefits
of a functional perspective. Coefficient functions have a smoothness determined by the
underlying spline expansion, and estimates borrow information directly across adjacent
time points.
Our specification relied on having all observations on the same grid, but this assumption
can be relaxed: if data contain subject-specific grids si , one can replace the matrix [Z⊗B(s)]
with one constructed by row-stacking matrices [Zi ⊗ B(si )] (note that it can take some care
to ensure the basis is constant across subjects, for example by explicitly defining knot
points).
An inspection of results in Figure 5.7, which improves on exploratory analyses by imple-
menting an explicitly functional regression approach, nonetheless indicates that this fit may
exhibit undersmoothing of functional coefficients. In the next section, we describe how to
induce an appropriate degree of smoothness by starting with a large number of basis func-
tions and then applying smoothing penalties. As we discuss how to incorporate smoothness
penalties and account for complex error correlations, it is helpful to view these as variations
on and refinements of a regression framework that uses familiar tools like OLS.
smoothing [116, 258, 322] because it balances model complexity, computational feasibility,
and allows inference via corresponding mixed effects models [258, 322]. A major emphasis
of this book is to show how the same ideas can be adapted and extended to functional data
regression.
In Section 5.2.1.1 we have used OLS to conduct estimation. These estimators cor-
respond to the maximum likelihood estimation in a model that assumes that residuals
ti = [i (s1 ), ..., i (sp )] are independent and distributed N (0, σ2 ). Here the variance σ2 is
assumed to be constant over time points, s ∈ S, and subjects, i. More precisely, maximizing
the (log) likelihood L(β; W) induced by this assumption with respect to spline coefficients is
equivalent to the OLS approach. Inducing smoothness on the spline coefficients can be done
using penalties on the amount of variation of βq (s). A common measure of this variation is
the second derivative penalty P (βq ) = S {βq (s)}2 ds, but other penalties can also be used.
It can be shown that this penalty has a quadratic form P (βq ) = βqt Dq βq , where Dq is a
known semi-definite positive matrix that provides the structure of the penalty. Many other
penalties have a similar quadratic form, but with a different penalty structure matrix Dq .
In this book we use only quadratic penalties.
The penalized log likelihood can then be written as
Q
Q
L(β, λ; W) = L(β; W) + λq P (βq ) = L(β; W) + λq βqt Dq βq , (5.2)
q=0 q=0
where the tuning parameters λq , for q between 0 and Q = 2, control the balance be-
tween goodness-of-fit and complexity of the coefficient functions βq (s). As discussed in Sec-
tion 2.3.2, the smoothing parameters can be estimated using a variety of criteria, though
here we will use the penalized likelihood approach introduced in Section 2.3.3.
For Gaussian outcome data we can reparameterize λq = σ2 /σq2 , where σq2 ≥ 0 are positive
parameters. With this notation and after dividing the criterion in equation (5.2) by σ2 we
obtain
L(β; W) βqt Dq βq
− − . (5.3)
2σ2 q
2σq2
Careful inspection of this approach indicates that the model can be viewed as a regres-
sion whose structure is defined by the conditional log-likelihood −L(β; W)/2σ2 , where the
coefficients of the splines are treated as random effects with multivariate, possibly rank
deficient, multivariate Normal distributions. More precisely, the model can be viewed as the
mixed effects model
[Wi |β, σ2 ] ∼ N ([B(s) ⊗ Zi ]β, σ2 ) , for i = 1, . . . , n ,
(5.4)
2 det(Dq )1/2 βqt Dq βq
[βq |σq ] = exp − , for q = 0, . . . , Q ,
(2π)Kq /2 σq 2σq2
where all conditional distributions are assumed to be mutually independent for i = 1, . . . , n
and q = 1, . . . , Q. The crucial point is to notice that the likelihood of model (5.4) is equiva-
lent to criterion (5.4) up to a constant that depends only on σq2 , q = 0, . . . , Q, and σ2 . The
advantage of model (5.4) is that it provides a natural way for estimating σq2 and σ2 based on
an explicit likelihood of a model. Moreover, note that the outcome likelihood [Wi |β, σ2 ] does
not need to be Gaussian. Indeed, it was only in the last step that we made the connection
to the Normal likelihood. In general, this is not necessary and the same principles work for
exponential family regression. The only difference is that we reparametrize λq = 1/σq2 and
make sure that the log-likelihood −L(β; W) is correctly specified for the outcome family
Function-on-Scalar Regression 157
distribution. This model specification will also allow to include error correlation structures
via more complex specifications of the error i (s). Finally, the prior distributions [βq |σq2 ]
are normal because we have chosen to work with quadratic penalties. This implies that we
work only with Gaussian random effects. This assumption can be further relaxed depending
on the requirements of the problem, what is computationally feasible, and how ambitious
modeling is.
The idea is remarkably simple and requires only careful inspection of equation (5.4) and
a willingness to view spline coefficients as random variables and penalized terms as prior
distributions. But once seen, it cannot be unseen. This approach has extraordinary impli-
cations because with this structure one can simply use mixed effects model estimation and
inference and existing software. This idea is not new in statistics and has been used exten-
sively in non-parametric smoothing. What is powerful is the fact that it extends seamlessly
to functional data, which allows us to use and compare powerful software such as mgcv,
nlme, Rstan [35, 281, 282], WinBUGS [48, 187], or JAGS [232]. The choice of frequentist or
Bayesian approaches becomes a matter of personal preference.
We will use the gam() function in the well-developed mgcv package to fit the FoSR
model with smoothness penalties. First, we explicitly create the binary indicator variable
for gender, and organize data into long format by unnesting the MIMS data stored as a
tf object. The resulting data frame has a row for each subject and epoch, containing the
MIMS outcome in that epoch as well as the scalar covariates of interest. This is analogous
to the creation of the response vector for model fitting using OLS, and begins to organize
covariates for inclusion in a design matrix.
With data in this form, we can fit the FoSR model using mgcv::gam() as follows.
gam fit =
gam(MIMS ∼ s(epoch) + s(epoch, by = gender) + s(epoch, by = age),
data = nhanes for gam)
Conceptually, this specification indicates that the expected MIMS value is the combination
of three smooth functions of time (epoch): an intercept function, and the interactions (or
products) of coefficient functions and scalar covariates gender and age. Specifically, the first
model term s(epoch) indicates the evaluation of a spline expansion over values contained
in the epoch column of the data frame nhanes gam df. The second and third terms add
by = gender and by = age, respectively, which also indicate spline expansions over the
epoch column but multiply the result by the corresponding scalar covariates. This process
is analogous to the creation of the design matrix Z ⊗ B(s), although gam()’s function s()
allows users to flexibly specify additional options for each basis expansion.
In contrast to the OLS estimation in Section 5.2.1.1, smoothness is induced in parame-
ter estimates through explicit penalization. By default, gam() uses thin-plate splines with
second derivative penalties, and selects tuning parameters for each coefficient using GCV
or REML [258, 322]. The results contained in gam fit are not directly comparable to those
obtained from OLS, and extracting coefficient functions requires some additional work. The
predict.gam() function can be used to return each element of the linear predictor for
a given data frame. In this case, we show how to return the smooth functions of epoch
corresponding to the intercept and coefficient functions. We therefore create a data frame
158 Functional Data Analysis with R
containing an epoch column consisting of the unique evaluation points of the observed func-
tions (i.e., minutes as stored in epoch arg); a column gender, set to 1 for all epochs; and
a column age, also set to 1 for all epochs.
The result contained in gam pred obj is a 1440 × 3 matrix, with columns corresponding
to β0 (s), 1 · β1 (s), and 1 · β2 (s), respectively. We convert these to tfd objects in the code
below. Note that gam() includes the overall intercept as a (scalar) fixed effect, which must
be added to the intercept function. With data structured in this way, we can then plot
coefficient functions using tools seen previously.
gam coef df =
tibble(
method = "GAM",
term = c("(Intercept)", "genderFemale", "age"),
coef =
tfd(t(gam pred obj), arg = epoch arg)) %>%
mutate(coef = coef + c(coef(gam fit)[1], 0, 0))
Figure 5.8 provides a comparison of the pointwise, B-spline regression splines, and pe-
nalized B-spline smoother estimators of the intercept, age, and gender effects in the FoSR
model (5.1). Each approach – separate epoch-level regressions, FoSR using OLS to estimate
spline coefficients, and FoSR implemented with smoothness penalties in mgcv::gam() –
yield qualitatively similar results regarding the effect of age and gender on diurnal MIMS
trajectories. This suggests that all approaches can be used at least in exploratory analyses
or to understand general patterns. That said, there are also obvious differences. The epoch-
level regressions do not borrow information across adjacent time points, and the OLS is
sensitive to the dimension of the basis expansion in the model specification; both are wig-
glier, and perhaps less plausible, than the method that includes smoothness penalties. As a
result, methods that explicitly borrow information across time and implement smoothness
penalties with data-driven tuning parameters are often preferred for formal analyses.
FIGURE 5.8: Comparing the pointwise (yellow), B-spline regression splines with 30 degrees
of freedom (green), and penalized B-spline smoothing (purple) of the time-dependent in-
tercept, age, and gender effects in the FoSR model (5.1). The outcome is physical activity
(MIMS) measured at the minute level.
FIGURE 5.9: Estimated functional residuals obtained after fitting model (5.1) using pe-
nalized spline smoothing of time-varying coefficients under the independence of residuals
assumption.
Most methods would focus on modeling the residuals i (s) = Xi (s) + ei (s), where Xi (s)
follows a mean zero Gaussian Process (GP) with covariance function Σ [18, 69] and ei (s)
are independent N (0, σe2 ) random errors. There are many strategies to model jointly the
nonparametric mean and error structure of this model including (1) joint modeling based
on FPCA decomposition of Xi (s); (2) functional additive mixed models (FAMM) using
spline expansions of the error term Xi (·) [263, 260]; and (3) Bayesian posterior simulations
for either type of expansion, which is related to Generalized Multilevel Function-on-Scalar
Regression and Principal Component Analysis [106]. We first describe these approaches,
while in Section 5.3 we introduce a fast scalable alternative based on pointwise regressions
for estimation and bootstrap of study participants for inference that follows the philosophy
of [57].
Recall that the current model for our example with two scalar covariates Zi1 and Zi2
can be written as
Any joint model is based on the decomposition of the structured error Xi (s) = B(s)ξi ,
where B(s) is a 1 × K dimensional vector of basis functions evaluated at s and ξi is a K × 1
dimensional vector of study participant-specific coefficients. The main difference between
the FPCA and FAMM approaches is that the former uses a data-driven FPCA basis and
the latter uses a pre-specified spline basis.
We now describe the approaches and their implementation in R. In keeping with the
philosophy of this book, models are fit using user-friendly functions that mask to some
degree the underlying complexity. However, we emphasize that these model fitting strategies
are grounded in familiar regression techniques.
Function-on-Scalar Regression 161
where the scores ξik ∼ N (0, λk ) and the errors ei (s) ∼ N (0, σe2 ) are mutually independent.
K ∞
Here we used the finite sum k=1 ξik φk (s) instead of the infinite sum k=1 ξik φk (s) for
practical purposes. The assumption is that the first K eigenfunctions explain most of the
variation in Xi (s) and what is left unexplained is absorbed in the error ei (s).
Q
From our notation i (s) = Xi (s) + ei (s) = Wi (s) − q=0 Ziq βq (s). Therefore, i (s) is
a zero mean Gaussian Process with covariance operator equal to Σ + σe2 I, where I is the
identity covariance operator. We have already shown that one could estimate the mean
structure of model (5.1) using any number of techniques. This allows the estimation of the
Q
residuals i (s) = Wi (s) − q=1 Ziq βq (s). These estimated residuals can then be thought of
as functional data and decomposed using the FPCA techniques discussed in Chapter 3.
Thus, our approach is to (1) obtain βq (s), the penalized splines estimators of βq (s)
for q = 0, . . . , Q under the assumption of independence of residuals; (2) obtain i (s) =
Q
Wi (s) − q=1 Ziq βq (s); (3) estimate the eigenfunctions φk (s) using FPCA applied to i (s);
and (4) fit the joint model (5.6) where φk (s) are plugged in instead of φk (s) and the βq (s)
functions are modeled as penalized splines.
It is important to note that in this model there are two types of random coefficients or
random effects. The first type are the spline coefficients, which are treated as random to
ensure the smoothness of the βq (s) functions. The second type are the mutually indepen-
dent scores ξik ∼ N (0, λk ) corresponding to the orthonormal eigenfunctions describing the
residual correlation. These random coefficients play different roles in the model, but, from
a purely computational perspective, they can all be treated as random coefficients. This
makes model (5.6) a mixed effects model.
Before conducting this analysis, we want to build the intuition first and make connections
to well-known concepts in mixed effects modeling. Consider, for example, the case when
K = 1 and φ1 (s) = 1 for all s ∈ S. In practice we will never have the luxury of this
assumption, but let us indulge in simplification. With this assumption the model becomes
where the scores ξi1 ∼ N (0, λ1 ) and the errors ei (s) ∼ N (0, σe2 ) are mutually independent.
This model adds a random study participant-specific intercept to the population mean
function. It is a particular case of model (5.1), but allows for the dependence of residuals
within study participants, i. This should be familiar from traditional regression strategies
for longitudinal data. Indeed, one of the first steps to account for the dependence of residuals
is to add a random intercept. It also provides a useful contrast between longitudinal data
analysis and functional data analysis: the former typically makes assumptions that limit the
flexibility of subject-level estimates over the observation interval S, while the latter uses
data-driven approaches to add flexibility when appropriate.
To fit the random intercept model, we adapt our previous implementation for penalized
spline estimation in a number of ways. Recall that nhanes gam df contains a long-form
data frame with rows for each subject and epoch, and columns containing MIMS and scalar
covariates. This data frame also contains a column of subject IDs SEQN; importantly, this is
encoded as a factor variable that can be used to define subjects in the random effects model.
162 Functional Data Analysis with R
In the R code, the terms corresponding to fixed effects are unchanged, but we add a term
s(SEQN) with the argument bs = "re". This creates a “smooth” term with a random ef-
fects “basis” – essentially taking advantage of the noted connection between semiparametric
regression and random effects estimation to obtain subject-level random effects estimates
and the corresponding variance component. Finally, we note that we use bam instead of gam,
and add arguments method = "fREML" and discrete = TRUE. These changes substantially
decrease computation times.
nhanes fpca df =
nhanes for gam %>%
mutate(
fitted = fitted(gam fit),
resid = MIMS - fitted) %>%
select(SEQN, epoch, resid) %>%
tf nest(resid, .id = SEQN, .arg = epoch)
The last step in our approach is to treat the resulting eigenfunctions φk (s) as “known”
and replace the φk (s) functions in model (5.6). These estimated eigenfunctions are stored
Function-on-Scalar Regression 163
in nhanes resid fpca. We will use a variation on the random intercept implementation to
do this. However, in this context ξik are uncorrelated random slopes on the “covariates”
φk (s), which are the FPCs evaluated over the functional domain. Put differently, we want
to scale one component in our model by another; to do this, we will again make use the by
argument in the s() function.
The code chunk below defines the data frame necessary to implement this strategy.
It repeats code seen before to convert the nhanes df data frame to a format needed by
mgcv::gam(), by creating an indicator variable for gender and unnesting the MIMS tf data
stored as a tf object. However, we also add a column fpc that contains the first FPC esti-
mated above. The FPCs are the same for each participant and are treated as a tf object.
We then unnest both MIMS and fpc to produce a long-format data frame with a row for each
subject and epoch. These data contain just the first principal component as the random
slope covariate, but one can easily modify the code to add PCs.
With these data organized appropriately, we can estimate coefficient functions and
subject-level FPC scores using a small modification to our previous random intercept ap-
proach. We again estimate subject-level effects using s(SEQN) and a random effect “basis”
that adds a random intercept for each participant. Including by = fpc scales the random
effect basis by the FPC value in each epoch, effectively creating the term ξi1 φ1 (s) by treat-
ing φ1 (s) as known. The code can easily be modified to incorporate additional principal
components.
We now show how to extract the quantities necessary to conduct estimation and inference
for coefficient function βq (s). Our approach to constructing pointwise confidence intervals is
based on the spline-based estimation approaches considered so far, as well as the assumption
that spline coefficient estimates have an approximately multivariate Normal distribution.
Treating tuning parameters as fixed, for any s ∈ S the variance of βq (s) is given by
where βq is the vector of estimated spline coefficients. From this, one can obtain standard
errors and construct a confidence interval for βq (s) using the assumption of normality.
A critical component is the covariance of estimated spline coefficients Var(βq ), which
is heavily dependent on the modeling assumptions used to estimate spline coefficients. We
have fit three FoSR models using penalized splines for coefficient functions, with differ-
ent assumptions about residuals, namely that residuals are uncorrelated within a subject;
residual correlation can be modeled using a random intercept; and residual correlation can
be accounted for using FPCA. We now compare the estimated coefficient functions and
pointwise confidence intervals obtained from these methods.
To extract coefficient functions and their standard errors from the objects produced by
mgcv::gam(), we use of the predict() function. This function takes an input data frame
that has all covariates used in the model, including the FPC; here we will use the first
1440 rows of the nhanes gam dfm data frame, which has all epoch-level observations for a
single subject. We set gender and age to 1 as before, so terms produced by predict() will
correspond to the coefficient functions. When calling predict(), we now set the argument
se.fit = TRUE, so that both the estimated coefficients and their standard errors are re-
turned.
coef df =
tibble(
term = c("(Intercept)", "genderFemale", "age"),
coef = tfd(t(nhanes fpc pred obj$fit[,1:3]), arg = epoch arg),
se = tfd(t(nhanes fpc pred obj$se.fit[,1:3]), arg = epoch arg)) %>%
mutate(coef = coef + c(coef(nhanes gamm fpc)[1], 0, 0)) %>%
mutate(
ub = coef + 1.96 * se,
lb = coef - 1.96 * se)
Function-on-Scalar Regression 165
FIGURE 5.10: Point estimators and 95% confidence intervals for the regression coefficients
of MIMS on age and gender. The models assume independent errors, parametric random
intercepts, and use FPCA to decompose the within-curve error structures.
Although we do not show all steps here, the same approach can be used to extract coef-
ficient functions and confidence intervals for the three approaches to accounting for residual
correlation. The results are combined into comparison plot df and plotted Figure 5.10,
again using geom errorband() to show upper and lower confidence limits.
The coefficient functions obtained by all methods are similar to each other and to those
based on epoch-level regressions. The confidence bands, meanwhile, differ substantially in
a way that is intuitive based on the assumed error structures. Assuming independence
fails to capture any of the correlation that exists within subjects, and therefore has overly
narrow confidence bands. Using a random intercept accounts for some of the true correlation
but makes restrictive parametric assumptions on the correlation structure. Because this
approach effectively induces uniform correlation over the domain, the resulting intervals
166 Functional Data Analysis with R
are wider than those obtained under the model assuming independence but have a roughly
fixed width over the day. Finally, modeling residual curves using FPCA produces intervals
that are narrower in the nighttime and wider in the daytime, which more accurately reflects
the variability across subjects in this dataset. This model suggests a significant decrease in
MIMS as age increases over much of the day, and a significant increase in MIMS comparing
women to men in the morning and afternoon.
In some ways, it is unsurprising that the coefficient function estimates produced under
different assumptions are similar. After all, each is a roughly unbiased estimator for the
fixed effects in the model, and differences in the error structure is primarily intended to
inform inference. But the complexity of the underlying maximization problem can produce
counter-intuitive results in some cases. In this analysis, the results of the FPCA model
fitting are sensitive to the degree of smoothness in the FPC. When the less-smooth FPCs
produced by the default FACE settings were used, the coefficient function estimates were
somewhat attenuated. At the same time, the random effects containing FPC scores were
dependent on the scalar covariates in a way that exactly offset this attenuation. This does
not appear to be an issue with the gam() implementation, because “by hand” model fitting
showed the same sensitivity to smoothness in the FPCs. Instead, we believe this issue stems
from subtle identifiability issues and the underlying complexity of the penalized likelihood.
nhanes famm df =
nhanes df %>%
mutate(
MIMS hour tf =
tf smooth(MIMS tf, method = "rollmean", k = 60, align = "center"),
MIMS hour tf = tfd(MIMS hour tf, arg = seq(.5, 23.5, by = 1))) %>%
tf unnest(MIMS hour tf) %>%
rename(epoch = MIMS hour tf arg, MIMS hour = MIMS hour tf value) %>%
pivot wider(names from = epoch, values from = MIMS hour)
nhanes famm df =
nhanes famm df %>%
select(SEQN, gender, age) %>%
mutate(MIMS hour mat = I(MIMS hour mat))
We now fit the function-on-scalar model using the pffr() function to analyze the asso-
ciation between age and gender on physical activity, where each residual Xi (s) is modeled
using penalized splines. The syntax is shown below.
nhanes famm =
nhanes famm df %>%
pffr(MIMS hour mat ∼ age + gender + s(SEQN, bs = "re"),
data = ., algorithm = "bam", discrete = TRUE,
bs.yindex = list(bs = "ps", k = 15, m = c(2, 1)))
Notice that the fixed effects are now represented differently. The functional intercept
β0 (s) is specified automatically in the pffr() syntax. In addition, the functional fixed ef-
fects βq (s)Ziq are specified by indicating the covariate Ziq only. For example, recall that
in mgcv::bam this term was specified as s(SEQN, by = age). In refund::pffr the same
term is specified as age. To specify the functional random effects Xi (s), the syntax in
refund::pffr() is s(SEQN,bs="re"), which indicates that a subject-specific penalized
spline is used. Although this syntax is similar to the one used in the function mgcv::bam(),
in refund::pffr() it actually specifies a functional random intercept instead of a scalar
random intercept for each subject. The two approaches have a very similar syntax, which
could lead to confusion, but they represent different models for the residual correlation.
The characteristics of the penalized splines used for fitting the functional residuals Xi (s)
are indicated in the bs.yindex variable. This is a list that indicates the penalized spline
structures. In our case, bs="ps" indicates the P-splines proposed by [71], k=15 indicates
the number of basis functions, and m=c(2,1) specifies a second-order penalized spline ba-
sis (quadratic spline), with a first-order difference penalty. These were modified from the
default number of knots k=5, which is insufficient in our application. In this example, in-
creasing the number of basis functions to 15 results in significantly longer computation
times due to the high time complexity of pffr(). Specifically, it takes about 70 seconds to
complete on a local laptop using 5 basis functions, while it takes about 27 minutes using the
same laptop when specifying 15 basis functions. Once specified, the spline penalty and basis
functions are the same for all Xi (s). The spline structures for the fixed functional intercept
were left unchanged to their defaults. They could be changed using the bs.int variable. As
168 Functional Data Analysis with R
FIGURE 5.11: Point estimators and 95% confidence intervals for the regression coefficients
for function-on-scalar regressions of hour-level MIMS on age and gender using pffr(), where
each residual is modeled using penalized splines. Results are shown for the intercept, age,
and gender parameters, respectively.
Here we will not provide the details of Bayesian implementations, but we will show the
general ideas for simulating from the complex posterior distribution. Assume that we have
prior distributions for all parameters in model (5.5). Then the general structure of the full
conditionals of interest is
1. [βq (s)|others] for q = 0, . . . , Q;
2. [Xi (s)|others] for i = 1, . . . , n;
3. [σe2 |others].
The notation here is focused on the concepts, as the detailed description of each conditional
distribution is notationally burdensome. For example, the full conditional [βq (s)|others]
refers to the full conditional distribution of the spline basis coefficients used for the expan-
sion of βq (s) and of the smoothing parameter associated with this function. Recall that when
sampling from [βq (s)|others], all other parameters, including Xi (s) and σe2 are fixed. There-
fore, these full conditionals are relatively simple and well understood. For example, with
standard choices of prior distributions, the full conditional of the spline coefficients of βq (s)
is a multivariate Normal distribution, while that of the smoothing parameter is an inverse
Gamma prior [51, 258]. Thus, this step can be conducted using direct sampling from the
full conditional distribution without the need for a Rosenbluth-Metropolis-Hastings (RMH)
step. The full conditionals for βq (s) contain information from all study participants, as they
appear in the likelihood for every subject in the study.
At every step of the simulation, the full conditionals [Xi (s)|others] depend only on the
likelihood for study participant i. The information from the other study participants is
encoded in the population level parameters βq (s) and σe2 , which are fixed because they are
conditioned on at this step. If Xi (s) are expanded into a basis (e.g., FPCA or splines),
the basis coefficients have multivariate Normal distributions if standard Normal priors are
used. The smoothing parameters are updated depending on the structure of the penalties.
For example, for the FPCA basis the scores on the kth component are assumed to follow
a N (0, λk ) distribution. It can be shown that with an inverse-Gamma prior on σk2 = λk ,
the full conditional for [σk2 |others] is an inverse Gamma. Similarly, if we use a spline basis
expansion for Xi (s), the variance parameter that controls the amount of smoothing has an
inverse-Gamma full conditional distribution if an inverse-Gamma prior is used. The specific
derivation looks different if we use one smoothing parameter per function Xi (s) or one
smoothing parameter for all functions. We leave the details to the reader. Irrespective of
the modeling structure (FPCA or splines), the number of full conditionals [Xi (s)|others]
increases with the number of subjects, though this increase is linear in the number of
subjects.
Finally, with an inverse-Gamma prior on σ2 it can be shown that [σe2 |others] is an
inverse-Gamma, which makes the last step of the algorithm relatively straightforward.
Therefore, in the case of Gaussian FoSR with functional residuals modeled as a basis
expansion, all full conditionals are either multivariate normals or inverse-Gamma. This
allows the use of Gibbs sampling [37, 95] without the RMH [42, 113, 200, 256] step, which
tends to be more stable and easier to implement.
When the functional data are not Gaussian, many of the full conditionals will require
an RMH step, which substantially increases computational times and requires tuning of the
proposal distributions. This can be done, but requires extra care when implementing the
software.
The Bayesian perspective is very useful and provides extraordinary flexibility. It has sev-
eral advantages including (1) provides a joint-modeling approach that more fully accounts
for the uncertainty in model parameters; (2) introduces a more unified approach to infer-
ence, where all parameters are random variables and the difference between “random” and
170 Functional Data Analysis with R
“fixed” effects is modeled via distributional assumptions; (3) simulates the full joint distri-
bution of all model parameters given the data; (4) can produce predictions and uncertainty
quantification for missing data within and without the observed domain of the functions;
and (5) it provides an inferential framework for more complex analyses that cannot be cur-
rently handled by existing non-Bayesian software. However, it also has several limitations,
including (a) deciding what priors to choose and what priors are non-informative is very
difficult in highly complex models; (b) some priors, such as inverse Gamma priors on vari-
ance components, do not allow the variance to be zero, which would correspond to simpler
parametric models; (c) computations tend to take longer, though they may still be scalable
with enough care; (d) small changes in models still require substantial changes in imple-
mentations; and (e) as implementations are slow, realistic simulation analysis of software
performance and accuracy is often not conducted.
min regressions =
nhanes df %>%
select(SEQN, age, gender, MIMS tf) %>%
tf unnest(MIMS tf) %>%
rename(epoch = MIMS tf arg, MIMS = MIMS tf value) %>%
nest(data = -epoch) %>%
mutate(
model = map(.x = data, ∼ lm(MIMS ∼ age + gender, data = .x)),
result = map(model, broom::tidy)
) %>%
select(epoch, result) %>%
unnest(result)
The next step in this analysis is to smooth the regression coefficients across epochs.
Many techniques are available for this; the approach implemented below organizes epoch-
level regression coefficients as tf objects in a term called coef, and then smooths the
results using a lowess smoother in the tf smooth() function. The resulting data frame has
three columns: term (taking values Intercept, age and genderFemale), coef (containing
the unsmoothed results of epoch-level regressions), smooth coef (containing the smoothed
versions of values in coef). Figure 5.12 below contains a panel for each term, and shows
epoch-level and smooth coefficients. Note the similarity between the smoothed coefficients
and those obtained by “functional” approaches, including penalized splines; this suggests
that this technique provides plausible results, even though smoothing is conducted by ig-
noring the correlation of functional residuals.
ui coef df =
min regressions %>%
select(epoch, term, coef = estimate) %>%
tf nest(coef, .id = term, .arg = epoch) %>%
mutate(smooth coef = tf smooth(coef, method = "lowess"))
A main focus of Section 5.2.2 was to model error structures and thereby obtain accurate
inference. The scalable approach we suggest in this section models each epoch separately,
but the residual correlation is implicit: regression coefficients across epochs are related
through the residual covariance. This fact, and the scalability of the estimation algorithm,
suggests that bootstrapping is a plausible inferential strategy in this setting. In particular,
we suggest the following resample participants, including full response functions, with re-
placement to create bootstrap samples; fit epoch-level regressions for each bootstrap sample
and smooth the results; and construct confidence intervals based on the results. This re-
sampling strategy preserves the within-subject correlation structure of the full data without
making additional assumptions on the form of that structure and account for missing data
at the subject level using standard mixed effects approaches. From a computational per-
spective, bootstrap always increases computation time because one needs to refit the same
model multiple times. However, pointwise regression and smoothing is a simple and rela-
tively fast procedure, which makes the entire process much more computationally scalable
172 Functional Data Analysis with R
than the joint modeling approaches described in Section 5.2.2. Moreover, the approach is
easy to streamline and parallelize, which can further improve computational times.
Our implementation of this analysis relies on a helper function nhanes boot fui, which
has arguments seed and df. This function contains the following steps: first, it sets the seed
to ensure reproducibility and then creates a bootstrap sample from the provided data frame;
second, it creates the hat matrix that is shared across all epoch-level regressions; third, it
estimates epoch-level coefficients by multiplying the hat matrix by the response vector at
each epoch; and fourth, it smooths these coefficients and returns the results. Because these
steps are relatively straightforward, we defer the function to an online supplement. The code
chunk below uses map and nhanes boot fui to obtain results across 250 bootstrap samples.
In practice one may need to run more bootstrap iterations, but this suffices for illustration.
After unnesting, we have smooth coefficients for each iteration and can plot the full-sample
estimates with the results for each bootstrap resample in the background.
FIGURE 5.13: Effects of age and gender on physical activity in the NHANES data using a
fast, scalable approach for estimation and bootstrapping for inference.
6
Function-on-Function Regression
We now consider the case when the response is a function and at least one predictor is a
function. This is called function-on-function regression (FoFR), which generalizes the scalar-
on-function regression (SoFR) in Chapter 4 and the function-on-scalar regression (FoSR)
in Chapter 5. This generalization provides the flexibility to model the association between
functions observed on the same study participant. We will focus on the linear FoFR and
provide details about possible extensions. FoFR is known under different names and was first
popularized by [242, 245] who introduced it as a functional linear model with a functional
response and functional covariates; see Chapter 16 in [245]. Here we prefer the more precise
FoFR nomenclature introduced by [251, 253], which refers directly to the type of outcome
and predictor. It is likely that this area of research was formalized by the paper [241], where
the model was referred to as the “linear model” and was applied to a study of association
between precipitation and temperature at 35 Canadian weather stations. FoFR may have
had other points of origin, as well, but we were not able to find them. The history of the
origins of ideas is fascinating and additional information may become available. There is
always the second edition, if we remain healthy and interested in FDA.
FoFR has been under intense methodological research, at least in statistics. In partic-
ular, it has been applied to multiple areas of scientific research including biliary cirrhosis
and association between systolic blood pressure and body mass index [149, 335], medfly
mortality [122, 326], forecasting pollen concentrations [300], weather data [16, 195], on-
line virtual stock markets [78], evolutionary biology [308], traffic prediction [46], jet engine
temperature [125], daily stock prices [249], bike sharing [148], lip movement [190], environ-
mental exposures [115], and electroencephalography [201], to name a few. Just as discussed
in Chapters 4 and 5, these papers are referenced here for their specific area of applica-
tion, but each contains substantial methodological developments that could be explored in
detail.
Here we will focus on methods closely related to the penalized function-on-function re-
gression framework introduced and extended by [110, 128, 263, 260, 262] and implemented
in the rich and flexible function refund::pffr. The software and methods allow multiple
functional predictors, smooth effects of scalar predictors, functional responses and/or co-
variates observed on possibly different non-equidistant or sparse grids, and provide inference
for model parameters. Regression methods are based on the general philosophy described
in this book: (1) expand all model parameters in a rich spline basis and use a quadratic
penalty to smooth these parameters; (2) use the connection to linear mixed effects models
(LME) for estimation and inference; (3) use existing powerful software designed for non-
parametric smoothing to fit nonparametric functional regression models; and (4) produce
and maintain user friendly software. Developing such software is not straightforward and
simplicity of use should be viewed as a feature of the approach. This simplicity is the result
of serious methodological development as well as trial and error of a variety of approaches.
This philosophy and general approach can be traced to [48, 102], who developed the meth-
ods for SoFR, but the same ideas can be generalized and extended to many other models. It
is this philosophy that allows for generalization of knowledge and seamless implementation
of methods.
175
176 Functional Data Analysis with R
In this chapter we will focus on the intuition and methods behind FoFR as implemented
in refund:pffr. As we mentioned, there are many other powerful methods that could be
considered, but, for practical purposes, we focused on pffr and the mixed effects repre-
sentation of nonparametric functional regression. To start, we first describe two motivating
examples: (1) quantifying the association between weekly excess mortality in the US before
and after a particular date (e.g., week 20); and (2) predicting future growth measurements
in children from measurements up to a particular time point (e.g., day 100 after birth).
6.1 Examples
6.1.1 Association between Patterns of Excess Mortality
Consider the problem of studying the association between patterns of US weekly excess
mortality in 52 US states and territories before and after a particular time during 2020.
Figure 6.1 provides an illustration of this problem where the “future” trajectories are re-
gressed on the “past” trajectories. The boundary between “past” and “present” is indicated
by the blue vertical line and separates data before May 23, 2020 and after, but including,
May 23, 2020. Data for all states and territories are displayed as light gray lines, while sev-
eral states are highlighted using color: New Jersey (green), Louisiana (red), and California
(plum). For these states, data from the “past” is shown as solid lines, while data from the
“future” is shown as dots. The general problem is to identify patterns of associations across
US states between the trajectories before and after May 23, 2020. Of course, this particular
separation between “past” and “future” is arbitrary and other cutoffs could be considered.
In this chapter we will discuss several ways to conceptualize and quantify such associations.
that r = 1 and S = U1 = . . . = UR = [0, 1]. However, methods apply more generally with
more complex notation and integral operations. For this case, the linear function-on-function
regression (FoFR) model was first proposed by [241] and has the following form
Wi (s) = f0 (s) + Xi (u)β(s, u)du + i (s) , (6.1)
U
where i (t) ∼ N (0, σ2 ) are independent random noise variables. While this assumption is
often made implicitly or explicitly in FoFR, it is a very strong assumption as the residuals
from such a regression often have substantial residual correlations. However, model (6.1) is
a good starting point for estimation, while acknowledging that inference requires additional
considerations.
There is a large literature addressing this problem; see, for example, [1, 16, 122, 123,
125, 234, 237, 249, 326, 335]. While some of these approaches have been implemented, their
use in applications should be considered on a case-by-case basis.
For presentation purposes we assume that f0 (s) = 0, even though f0 (s) and β(s, u)
can be estimated simultaneously. Before diving deeper into methods for estimation, it is
worth taking a step back to build some intuition about the interpretation of the model.
To start, let us fix the location of the outcome function, s. Then model (6.1) is simply a
scalar-on-function regression (SoFR), as discussed in Chapter 4. Therefore, one approach
178 Functional Data Analysis with R
where Bk (s, u) is a basis in R2 . For presentation purposes we use the tensor product
of splines, where k is an indexing of the pair (k1 , k2 ) and Bk (s, u) = Bk1 ,k2 (s, u) =
Bk1 ,1 (s)Bk2 ,2 (u) for k1 = 1, . . . , K1 and k2 = 1, . . . , K2 , where K1 and K2 are the number of
bases in the first and second dimension, respectively. With this notation, the total number
of basis functions is K = K1 K2 and
K1
K2
β(s, u) = βk1 k2 Bk1 ,1 (s)Bk2 ,2 (u) .
k1 =1 k2 =1
Function-on-Function Regression 179
where Di,k2 = U Xi (u)Bk2 ,2 (u)du is a random variable that depends on i (the study par-
ticipant) and k2 (the index of the basis function in the second dimension), but not on u
(the argument of the Xi (u) function). In practice, the function Xi (u) is not observed at
every u and the Riemann sum approximations to the integrals U Xi (u)Bk2 ,2 (u)du will be
used instead. Similarly, the function Wi (s) is not observed at every point s and is observed
on a grid s1 , . . . , sp instead. Denote by Cj,k1 = Bk1 ,1 (sj ). With this notation, model (6.1)
becomes the following standard regression model
K1
K2
Wi (sj ) = βk1 k2 Cj,k1 Di,k2 + i (sj ) , (6.2)
k1 =1 k2 =1
where Aji,k1 ,k2 = Cj,k1 Di,k2 are the predictors of Wi (sj ) and the regression parameters are
β = (β11 , β12 , . . . , βK1 K2 )t . Denote by
W = Xβ + , (6.3)
where is the (np)×1 dimensional with entries i (sj ) ordered the same way as Wi (sj ) in the
vector W. Therefore, fitting a parametric FoFR model is equivalent to fitting a standard
regression model. Fitting a nonparametric smoothing model adds quadratic penalties on
the β parameters, as described in Section 2.3.2. If Dλ is a penalty matrix that depends on
the vector of smoothing parameters λ then a penalized spline criterion would be of the type
As discussed in Section 2.3.3, this criterion is equivalent to fitting a linear mixed effects
model, where λ are ratios of variance components. Therefore, the nonparametric model
can be estimated using mixed effects methods, which also produce inference for all model
parameters. By now we have repeated this familiar tune: (1) functional regression can be
viewed as a standard regression; (2) smooth FoFR can be viewed as a bivariate penalized
regression with specific design matrices; and (3) smooth FoFR estimation and inference can
be conducted using a specific mixed effects model that can be fit using existent software.
Each step is relatively easy to understand, but when considered together they provide a
recipe for using existing software directly for estimation and inference for FoFR.
In this case we have considered the tensor product of splines, but the algebra works
similarly for any type of bivariate splines, including thin-plate splines. Each choice of basis
can be used with its standard quadratic penalties; see, for example the construction of tensor
product bases using the te, ti, and t2 options in the mgcv package. The main message here
is that the FoFR regression can be reduced to a nonparametric bivariate regression with
180 Functional Data Analysis with R
a specific regression design matrix and appropriate penalties. Once this is done, software
implementation requires only careful accounting of parameters and model structure.
While we have considered the case when the outcome functions Wi (s) are continuous
and Gaussian, this assumption is not necessary. The same exact approach works with non-
Gaussian data including binary and count observations. The method could also be extended
to any other bases, as long as they involve a quadratic penalty on the model coefficients.
While non-quadratic penalties are possible, they are not covered in this book.
In Section 6.3 we show how to use pffr to fit such models directly and how to expand
models to multiple functional predictors as well as nonparametric time-varying effects of
covariates. We also show how to change the approach to account for sparse FoFR in Sec-
tion 6.3 using mgcv. Recall that pffr is based on mgcv, though making the connection and
showing exactly how to do that was a crucial contribution [128, 260, 263]. Indeed, other
packages for FoFR exist, including the linmod and fRegress in the fda R package [246] and
PACE [334] in MATLAB. However, linmod and PACE cannot currently handle multiple func-
tional predictors or linear or non-linear effects of scalar covariates. The fRegress function
is restricted to concurrent associations with the predictor and outcome functions being re-
quired to be observed on the same domain. Another package that could be considered is the
fda.usc [80]. Here we focus on the refund::pffr and mgcv::gam functions, which provide
a highly flexible estimation and inferential approach for a wide variety of FoFR models. We
encourage the use of multiple approaches and deciding for oneself what and when works for
a particular application.
where the square root operator is applied separately for each vector and the diag operator is
the diagonal operator for a matrix. As discussed in Section 2.4, correlation and multiplicity
adjusted (CMA) confidence intervals can be obtained under the normality assumption of β
or using simulations from the max deviation statistic under the assumption of spherically
Pointwise and global CMA p-values can then be obtained by
symmetric distribution of W.
searching the supremum value of α where the corresponding 1 − α level confidence intervals
do not cross zero. We will address this in depth in Section 6.5.2.
Function-on-Function Regression 181
To construct 95% prediction intervals for individual observations one needs to account
for the residual variance, σ2 . Thus, the pointwise 95% confidence intervals have the following
structure
i ± 1.96 diag{V(
W W i) + σ
2 Ip } ,
where σ W
2 is an estimator of the residual variance, σ2 . Here V( i ) characterizes the uncer-
tainty of the estimated prediction and σ 2 Ip characterizes the uncertainty of the observations
around this prediction. The prediction intervals for individual observations tend to be much
larger than for the predictor. The problem of building correlation and multiplicity adjusted
(CMA) prediction intervals is as of yet, an open methodological problem.
Confidence and prediction confidence intervals can be obtained at any other set of points
s0 , not just at sj ∈ S. This requires recalculating the design matrix that corresponds to these
points, say X0 , and conducting the same inference for W 0 = X0 β instead of W i = Xi β.
The index 0 here is a a generic notation for data at a new point, while the index i is the
notation for data at an observed point i within the sample.
Once predictions are calculated it is easy to compute the estimated residuals i (sj ) =
Wi (sj ) − W i (sj ) for every i and j. These residuals can be used to investigate whether the
assumptions about these residuals hold in a particular model and data set. Simple tests for
normality and zero serial correlations can be applied to these estimated residuals to point
out potential problems with the model and suggest modeling alternatives. Large residuals
can also be investigated to identify portions of functions that are particularly difficult to
fit. See Section 6.3.1 for an example of residual analysis using the refund::pffr function.
If model assumptions are violated, questions can be raised about the validity and perfor-
mance of confidence intervals. In such situations, we recommend to use the point estimators
and supplement the confidence intervals obtained from the model with confidence intervals
obtained from a nonparametric bootstrap of study participants.
week starting on May 23, 2020 (first 20 weeks of the year). This problem was first described
in Section 6.1.1 and illustrated in Figure 6.1.
In this case, the outcome functions are Wi : {21, . . . , 52} → R, where Wi (sj ) is the
excess mortality for state or territory i in week sj ∈ {21, . . . , 52}. The predictor is Xi :
{1, . . . , 20} → R, where Xi (u) is the excess mortality for state or territory i in week u ∈
{1, . . . , 20}. Here 21 corresponds to the week of May 23, 2020.
We now show how to implement this approach using pffr. Assume that the data are
stored in the matrix Wd, which is 52 × 52 dimensional because there are 52 states and ter-
ritories in the data and 52 weeks in 2020. Each row corresponds to one state and each
column corresponds to one week in 2020. The code below shows how data are separated in
functional outcomes, Wi (·), and predictors, Xi (·).
This model is the FoFR model (6.1) with a domain-varying intercept, f0 (·), and a bi-
variate smooth functional effect, β(·, ·). Both of these effects are modeled nonparametrically
using penalized splines. We show now how to extract the estimators of these functions from
the pffr-object m1, which is a gam-object with some additional information.
FIGURE 6.3: Intercept and 95% confidence interval for the FoFR regression implemented
using pffr. The regression is of the excess mortality in 50 US states and 2 territories in the
last 32 weeks of 2020 on the excess mortality in the first 20 weeks of 2020.
23, 2020 with some states having negative excess mortality. However, towards the end of
the period almost all states have positive excess mortality with the mean around 80 for
every one million individuals in the US. In fact, the nonparametric estimator of the mean
displayed in Figure 6.3 is similar to the observed data for the state of California (plum)
displayed in Figure 6.1.
We now show how to extract the surface β(s, u) : [0, 20] × [20, 52] → R. The code below
extracts the surface as well as the s and u coordinates of where the surface is estimated.
The estimated coefficient is now stored in the variable smcoef, which is then plotted in
Figure 6.4. The x-axis corresponds to the first 20 weeks of the year, while the second axis
corresponds to the last 32 weeks of the year. For interpretation purposes, recall that the
excess mortality in the first 10-12 weeks of the year was probably not affected by COVID-19.
Strong effects start to be noticed somewhere between week 13 and week 20 of the year.
A close inspection of Figure 6.4 indicates that the strongest effects appear in the right
bottom corner of the graph (note the darker shades of red). This is to be expected, as these
are the associations between the observations in week 20 and observations immediately after
week 20. This means that states that had a high excess mortality in weeks 18-20 had a high
184 Functional Data Analysis with R
FIGURE 6.4: Smooth estimator of the association between US excess mortality in the first
20 weeks and last 32 weeks of 2020 using FoFR implemented via pffr.
excess mortality in weeks 21-24. Similarly, states that had a low excess mortality in weeks
18-20 had a low excess mortality in weeks 21-24. The upper-right corner indicates that
states with a high/low excess mortality in weeks 18-20 had a high/low excess mortality at
the end of the year. There are also two vertical bands of darker blue (negative coefficients)
corresponding to weeks 2-7 and weeks 12-17. The darker blue band for weeks 12-17 indicates
that states that had a higher than average initial excess mortality after COVID-19 started
to affect excess mortality, tended to have lower than average excess mortality for the rest
of the year. Similarly, states that had a lower initial excess mortality between weeks 12-17
tended to have a higher than average excess mortality for the rest of the year.
The darker blue in the left-upper corner of the graph indicates that states that had a
higher than average mortality between weeks 3-7 had a lower excess mortality towards the
end of the year. This finding is a little surprising and may require additional investigation.
However, it is unlikely that excess mortality in the first 3 to 5 weeks was affected by COVID-
19. This indicates that these associations may be due to other factors, such the demography
or geography of different states or territories. A light blue vertical band in the middle of
the plot corresponds to very small or zero effects indicating that excess mortality between
weeks 8 to 12 is not strongly associated with excess mortality after week 20.
Our next steps are to extract the fitted values W i (sj ) = f0 (sj ) + Xi (u)β(s
j , u)du and
i (sj ) = Wi (sj ) − Wi (sj ). Both of these results are stored as 52 × 32 dimen-
the residuals
sional matrices, where every row corresponds to a state, i, and every column corresponds
to a week, j, between 21 and 52.
FIGURE 6.5: Predicted values and residuals for the FoFR regression of US excess mortality
in the last 32 weeks on the first 20 weeks of 2020 using pffr.
Figure 6.5 displays the fitted values in the top panel and the residuals in the bottom
panel. The x-axis in both plots represents a week number starting from May 23, 2020
(indicated as week 21 of the year). The y-axis corresponds to a state or territory number.
The fitted values (top plot) indicate that for most states there is a substantial increase
from week 21 to week 52 of the year; note the shades of blue changing to shades of red as
time passes (moving from left to right in the fitted values panel). This is consistent with
the estimated intercept f0 (sj ) displayed in Figure 6.3. The residual plot is also useful to
check whether the assumptions about residuals i (sj ) are reasonable. First, it seems that
residuals are reasonably well centered around 0, though some residual correlations seem to
persist after model fitting. This can be seen as persistent shades of a color for individual
states (along the rows of the image). Also, some of the residuals seem to be particularly
large in the range of 100 and −100, which are very large values for weekly excess mortality
per one million individuals in the US.
FIGURE 6.6: Observed and predicted weekly excess mortality for four states: New Jersey,
Louisiana, California, and Maryland. Observed data are shown as lines for the first 20 weeks
and as dots for the last 32 weeks of 2020. The blue vertical line indicates the threshold
between “past” and “future.” The dark red line indicates the prediction of the data based
on model (6.1) implemented in pffr.
Figure 6.6 displays prediction and actual data for four states: New Jersey, Louisiana,
California, and Maryland. Data used for prediction is shown as a continuous line before
May 23, 2020 while the predicted data are shown as dots of the same color after May 23,
2020. The model prediction is indicated as a dark red solid line starting from May 23, 2020.
For these four states the predictions and the real data look reasonably close and are on
the same scale. For New Jersey the predictions immediately after May 23, 2020 tend to be
pretty close to the actual data and capture the down-trend in excess mortality. The model
correctly captures the increase in mortality towards the end of the year, though the observed
excess mortality tends to be consistently higher than the predicted excess mortality. For
Louisiana the prediction model considerably underestimates the observed data during July
Function-on-Function Regression 187
and August and slightly overestimates it during October, November and December. In
general, when predictions do not match the observations they tend to under or overshoot
for at least a month or so. This raises questions about the potential residual correlations.
Moreover, the second panel in Figure 6.5 indicated that some of the residuals are very
large with differences as large as 100 weekly excess deaths per one million individuals. Here
we take a closer look at the estimated residuals i (sj ), which are stored
i (sj ) = Wi (sj ) − W
in the residual values variable. We are interested both in investigating the normality of
these residuals as well as some of the larger residuals irrespective of whether or not their
distribution is normal.
Figure 6.7 provides a more in-depth analysis of the residuals. The top-left panel provides
a QQ-plot of the residuals relative to a normal distribution. This indicates a reasonably
symmetric distribution with much heavier tails than a normal. We now investigate where
the large residuals originate and identify that the large negative residuals can be traced to
the state of North Carolina. The top-right panel displays the data for North Carolina and
provides an insight into exactly why these residuals occurred. Note that the weekly excess
mortality in the state of North Carolina remained close to zero in the last part of the year
and even dipped down substantially under zero in the last month of the year. Based on the
data from the other states, the predicted values went steadily up during the last part of the
year. Many of the large positive residuals can be traced to the states of North and South
Dakota (the two panels in the second row). In contrast to North Carolina, both these states
had a strong surge in the weekly excess mortality in the last part of the year. These surges
were much larger than what was anticipated by the model.
The results displayed both in Figures 6.6 and 6.7 show that under- and over-prediction
tend to happen for several weeks and even months, which raises questions about whether
residuals can be assumed to be independent. Figure 6.8 displays the estimated residual
correlations as a function of week number from May 23, 2020 (first week when model
predictions are conducted). This is obtained by using the cor function applied to the vari-
able residual values, which has 52 rows corresponding to states and territories and 32
columns corresponding to weeks after May 23, 2020. As anticipated from the visual in-
spection of results, strong positive and negative correlations can be observed. Residuals
for weeks 25 to 35 tend to be positively associated with residuals for weeks 25 to 35 and
negatively associated with residuals for weeks 40 to 50. This indicates that if the model
under/over predicts for one week between weeks 25 and 35, then it will tend to under/over
predict for multiple weeks in this period and it will tend to over/under predict for weeks
40 to 50.
So, we have found that residuals have a relatively symmetric distribution, with much
heavier tails than the normal distribution, large absolute deviations that correspond to
particular states, and very strong local correlations that persist for a few months. Essentially,
almost all model assumptions are rejected, which raises questions about the validity of
inference for these models. This is an open problem in functional data analysis that is
rarely acknowledged. One potential solution could be to consider the bootstrap of sampling
units (in our case states and territories), but it is not clear whether this is a good approach
in general. Another potential solution could be to model the residuals, but this would
substantially increase computational complexity.
FIGURE 6.7: Residual analysis and lack of fit exploration for weekly excess mortality anal-
ysis using pffr. Top-left panel: QQ-plot analysis of the estimated residuals for model (6.1)
applied to weekly excess mortality in the US. Other panels: observed and predicted data
for three states (North Carolina, North Dakota, South Dakota) with the largest residuals.
time-varying effect of state population size on the observed weekly excess mortality.
Model (6.1) can easily be extended to the model
Wi (s) = f0 (s) + Pi f1 (s) + Xi (u)β(s, u)du + i (s) , (6.5)
U
where Pi is the population size of state or territory i. Here f1 (·) is the time-varying effect
of the population size. This effect is modeled using penalized splines, like all other effects,
using a similar basis expansion and corresponding penalties.
The function pffr seamlessly incorporates such effects using the following code struc-
ture (note the minimal change to code):
Function-on-Function Regression 189
FIGURE 6.8: Residual correlations for the model (6.1) applied to weekly US excess mortality
data using pffr.
#Fit pffr with one functional predictor and one time-varying effect
m2 <- pffr(W ∼ ff(X, xind = s) + pop state n, yind = t)
This code can now be used to extract estimators of interest from the model fit m2 using
similar approaches indicated for model fit m1. Below we show how to extract the time-
varying effect and standard error for the population size variable.
The point estimator for f1 (s) is contained in the variable pop size effect and its stan-
dard error is contained in the variable pop size se. Figure 6.9 displays the point estimators
and 95% pointwise confidence intervals for f1 (s) as a function of time from the week of May
23, 2020 (week 21 of the year). The effect corresponds to one million individuals. The shape
of the effect indicates that states with a larger population had larger excess mortality in July
190 Functional Data Analysis with R
FIGURE 6.9: Time-dependent effect of population size on US weekly excess mortality after
May 23, 2020 using pffr. This is the effect for every one million additional individuals.
and August and lower excess mortality in November and December. However, the effects
for both periods are relatively small compared to some of the other effects. Of course, one
could explore other characteristics of states including geography, implementation of mitiga-
tion policies, or population density. Such variables can be added either as time varying or
fixed predictors in pffr; see the help file for the pffr function for additional details.
Note that we have conducted all analyses for a particular time point, May 23 2020.
However, this choice is arbitrary and the same analyses can be conducted at any other time
point during the year. An analysis conducted at all time points provides a dynamic FoFR
regression [43, 46, 101, 126, 127].
There are two key differences between the CONTENT data and the US Covid-19 excess
mortality. First, the CONTENT data collected for each individual are sparse both before
and after the time when prediction is conducted. Second, instead of only including one
functional predictor, we are now including the z-score of length and weight in the first
100 days, both of which are functional predictors. There are many ways to handle sparse
functional data. Here we follow a simple technique that combines smoothing the data before
and after the time when prediction is conducted using face::face.sparse and then using
refund::pffr to conduct regression of these smooth estimates evaluated on a regular grid.
This approach is computationally fast, though more research may be necessary to address
estimation accuracy especially in areas with very sparse observations.
Conceptually, the outcome functions are Wi : {101, 103, . . . , 701} → R, where Wi (sj ) is
the z-score for length (zlen) for child i on day sj . The predictors are Xi1 : {1, . . . , 100} → R
and Xi2 : {1, . . . , 100} → R, where Xi1 (u) and Xi2 (u) are the z-score for length (zlen) and
z-score for weight (zwei) for child i on day u, respectively. We are interested in conducting
the following function-on-function regression (FoFR)
Wi (sj ) = f0 (sj ) + Xi1 (u)β1 (sj , u)du + Xi2 (u)β2 (sj , u)du + i (sj ) , (6.6)
U U
where i (sj ) ∼ N (0, σ2 ) are independent random noise variables. Of course, we do not
observe Wi (sj ), Xi1 (u), Xi2 (u) and we will use estimates of these functions obtained from
sparse noisy data. An important point in our example is that only information from the past
(observations before day 100 from birth) is used to reconstruct the smooth functions in the
past (Xi1 (u) and Xi2 (u) for u < 100) on an equally spaced grid. Similarly, only information
from the future (observations after day 100 after birth) is used to reconstruct the smooth
functions in the future (Wi (sj ), sj ≥ 100) on an equally spaced grid. Once the functions
are estimated on equally spaced grids, the same fitting approach described in Section 6.2
can be used to fit model (6.6).
Below we show how to reorganize the CONTENT data into the required format
for conducting smoothing using face::face.sparse. The variables content zlen old,
content zwei old, content zlen new contain the CONTENT data corresponding to zlen
and zwei data before day 100 and zlen data after day 100, respectively. Each one
of these sparse data sets is then smoothed using sparse functional smoothing imple-
mented in face.sparse and the results are stored in fpca zlen old, fpca zwei old, and
fpca zlen new. These variables contain the scores and the estimated eigenvalues necessary
to reconstruct the smooth trajectories Xi1 (u), Xi2 (u), and Wi (s) in this order.
After performing sparse FPCA, the estimated z-scores of length and weight for each day
of the first 100 days are stored in zlen old it and zwei old it respectively, both of which
are matrices with 100 columns (one column for each day from birth). For the outcome, we
obtain the estimated z-scores of length after day 100 every other day, that is, days 101,
192 Functional Data Analysis with R
day 103, . . . , day 701. Therefore, the outcome is stored in zlen new it, a matrix with 301
columns. All these matrices have 197 rows, each row corresponding to one child. At this
point we have everything we need to run a standard FoFR model with outcome and pre-
dictor functions observed on an equally spaced grid of points.
The syntax to implement an FoFR with two functional predictors is similar to the one
used for a single predictor, as introduced at the beginning of Section 6.3. Specifically, the
ff function is used to specify an FoFR term. Note that the pffr function includes a func-
tional intercept by default. After fitting the model, the coefficients can be extracted from
the fitted object using the coef() function. Notice that here we only show the code to fit
FoFR and extract coefficients, since the remaining steps are similar to those described in
the COVID-19 example.
#Fit PFFR
m content <- pffr(zlen new it ∼ ff(zlen old it, xind = xind) +
ff(zwei old it, xind = xind), yind = yind)
#Extract coefficients
allcoef <- coef(m content)
Function-on-Function Regression 193
FIGURE 6.10: Smooth estimator of the association between the z-score of length in the 101
days or later and the z-score of length (left panel) and weight (right panel) in the first 100
days in the CONTENT study using FoFR implemented via pffr.
The estimated coefficients are plotted using the image.plot function in the fields
package [217] and are shown in Figure 6.10. The left panel shows the estimated association
between the z-score for length after day 101 (y-axis) and the z-score of length before day
100 day (x-axis). The right panel displays the estimated association between the z-score of
length after day 100 (y-axis) and the z-score of weight before day 100 (x-axis).
We first focus on the left panel, which displays the estimated coefficient β1 (s, u), which
quantifies the association between the z-score for length before and after day 100. The z-
score of length at around day 100 has the strongest association with the z-score of length
after 100 days, and this association is consistent across days after day 100. This is indicated
by the red vertical band between day 90 and 100 after birth. The shades of red get darker
closer to day 100 indicating that the length of the baby closer to the time when prediction
is made is a strong predictor of future length. The fact that the red vertical band extends to
day 90 seems to indicate that sustained high values of length before day 100 may improve
prediction above and beyond the length of the baby at day 100. This is not surprising, as
day 100 is the closest day to the days after day 100 and the estimated trajectories are quite
smooth in the neighborhood of an observation. It may be surprising that there is a large
vertical band of negative estimated coefficients around day 40. The coefficients are smaller
(−0.01 compared to 0.05) and some may not be statistically significant. But results seem to
imply that a baby who is tall at day 100 and tall at day 40 is predicted to be shorter than
a baby who is as tall at day 100 but shorter at day 40. This seems to also make sense as
this would capture faster growth trajectories. This type of analysis suggests the possibility
of a more in-depth analysis of individual trajectories.
The right panel displays the estimated coefficient β2 (s, u), which quantifies the associa-
tion between the z-score for weight before day 100 and the z-score for length after day 100.
This surface has many similarities to the surface for β1 (s, u). This indicates that individuals
who are heavier around day 100 tend to be longer in the future. Some of these associations,
may be due to the correlation between height and weight and indicate that babies who are
heavier at day 100 will tend to be taller later on even if they are as long at day 100. Just as
in the case of the β1 (s, u) surface, there is a blue vertical band around day 40 from birth,
194 Functional Data Analysis with R
where i (s) are independent identically distributed N (0, σ2 ) random variables with s ∈ S.
In our example, the functional predictor, Xi (u), u ∈ U = {1, . . . , 20} and the response
Wi (s), s ∈ S = {21, . . . , 52} have time in weeks as argument, though their domains do not
overlap.
Recall that the refund::pffr approach to fitting FoFR models involves applying a
bivariate spline basis to β(s, u), then approximating the term S Xi (u)β(s, u)du numerically.
By default refund::pffr uses a tensor product smooth of marginal spline bases to estimate
β(s, u), though other bivariate bases could be used (e.g., thin plate regression splines). Below
we illustrate how this approximation is implemented.
Wi (s) = f0 (s) + Xi (u)β(s, u)ds + i (t)
U
≈ f0 (s) + ql Xi (ul )β(s, ul ) + i (s) Numeric approximation
l
K2
K1
= f0 (s) + ql Xi (ul ) βk1 k2 Bk1 (s)Bk2 (ul ) + i (s) . Spline basis
l k1 =1 k2 =1
Function-on-Function Regression 195
The second line is the numeric approximation to the integral, where ql are the Riemann sum
weights (length of the interval where the integral is approximated) and ul are the points
where the components of the Riemann sum are evaluated. The third line is obtained by
expanding β(s, ul ) in a tensor product of marginal spline bases.
To estimate this model using mgcv::gam we first create a long format data matrix with
the elements needed to fit the model. Specifically, each row will contain a single functional
response {Wi (s), s = 21, . . . , 52}, the entire functional predictor {Xi (u) : u = 1, . . . , 20}, as
well as matrices associated with the domain of the functional predictor u ∈ U = {1, . . . , 20},
of the functional response, s ∈ S, and the quadrature weights, ql , multiplied element-wise
by the functional predictor Xi (ul ) for numeric approximation.
K 1 K 2
An important insight is that the component k1 =1 k2 =1 βk1 k2 Bk1 (s)Bk2 (ul ) of the
model is a standard bivariate spline and the quantities ql Xi (ul ) can be viewed as additional
linear terms that multiply the linear spline decomposition. We will show how to construct
this structure in the mgcv package using the by= option. This is a crucial technical detail
that is used throughout this book to conduct functional regression using software originally
designed for semiparametric smoothing.
First, let W , X be n × |S| and n × |U | dimensional matrices containing the functional
response and predictor, respectively, with each row corresponding to a subject i = 1, . . . , n.
In the R code below the matrix W is identified as W and the matrix X is identified as X.
Next, let vec(A) denote the vectorization of a matrix A (stacking columns of the matrix).
We construct the n|S| × 1 dimensional vector wv = vec(W t ), which contains the stacked
outcome vectors in the order: study participants and then observations within study par-
ticipant. This vector is identified as wv.
We also construct 1|S| , which is a column vector of ones with length equal to the number
of observations of the functional outcome, |S|. Recall that in our example |S| = 32. The
next step is to construct the functional predictor matrix X v = X ⊗ 1|S| where ⊗ denotes
the Kronecker product. The matrix X is n × |U | dimensional and the matrix X v is n|S| ×
|U | dimensional. The matrix Xv is referred to as Xv in the code below. Next, let q =
[q1 , . . . , q|U | ]t be the |U | × 1 dimensional column vector of quadrature weights (ql in the
model above) and L = 1n|S| ⊗ qt be the n|S| × |U | dimensional matrix where each row is
equal with the vector q, which contains the quadrature weights. Here 1n|S| is the column
vector of ones of length n|S|. This matrix is referred to as L in the code below. Then we
construct the n|S| × |U | dimensional matrix XLv = X v L with denoting the element-
wise product and is referred to as XvL in the code. This matrix will be supplied to the by=
argument, as mentioned in the paragraph above. That is, the matrix XLv corresponds to
the ql Xi (ul ) components of the FoFR model.
To complete the data set up we need two additional matrices for the definition of the
bivariate spline basis and one vector of arguments for defining the population mean effect
f0 (s). Specifically, we construct the n|S| × |U | dimensional matrix U v = 1n|S| ⊗ Ut , where
Ut = (u1 , . . . , u|U | ) is the vector of arguments for the predictor function. The matrix U v has
n|S| rows, where each row consists of Ut , the 1 × |U | dimensional row vector containing the
domain of the functional predictor. It is referred to as Uv in the code below. The next step
is to build the n|S| × 1 dimensional vector sv = 1n ⊗ S, where 1n is the n × 1 dimensional
column vector of ones and S = (s1 , . . . , s|S| )t is the |S| × 1 column vector of arguments
for the outcome function. This vector is obtained by repeating the vector S denoted by sv
in the code below n times, where n is the number of functions. The vector is labeled "sv"
in the data frame. We also construct the n|S| × |U | dimensional matrix S v = sv ⊗ 1t|U |
with |U | identical columns, and each column being equal to sv , the n|S| × 1 dimensional
vector that contains the domain of the functional response repeated n times. This matrix
is referred to as Sv and is recorded as "Sv"=I(Sv) in the data frame.
196 Functional Data Analysis with R
The code below constructs these matrices and puts them in a data frame. To be consis-
tent with refund::pffr defaults we use Simpson’s rule for creating quadrature weights.
The data frame for fitting the model directly using mgcv then contains
These objects are all stored in a data frame, with the matrices stored as objects of the
class AsIs via the I() function. Having set up the required data frame for model fitting,
estimation is done using the following call to mgcv::gam().
We now describe the syntax and how it relates to the elements of our FoFR model.
Note that specific arguments, which are not the mgcv::gam() defaults were chosen so that
the results are identical to the refund::pffr() fit. The syntax above specifies that the
Function-on-Function Regression 197
v
functional response, w (wv) is the outcome variable. The two components of the linear
predictor, f0 (s), U Xi (u)β(u, s)du are specified by the calls to the s() and te() functions,
respectively.
First consider the specification of f0 (t). The syntax s(sv, bs = "ps", k = 20, m =
c(2, 1)) adds to the linear predictor a smooth function of sv, f0 (s), which is modelled
using penalized B-splines (bs = "ps") with 20 (k = 20) knots. The B-splines are of order 2
and have a first-order difference penalty indicated by (m=c(2, 1)). This part of the function
is standard in nonparametric smoothing using mgcv.
Next, consider the specification of U Xi (u)β(u, s)du. The te() function specifies a ten-
sor product smooth of marginal bases, with the marginal bases defined by the first unnamed
arguments to the function call. In this case, we specify a bivariate smooth of Uv, the func-
tional domain of the predictor (U v ), and Sv, the functional domain of the outcome (S v ).
The marginal bases are penalized B-splines (bs="ps", argument is recycled for the second
basis), each with 5 knots (k=c(5, 5)). Of course, these arguments could be changed and
one could specify a different number of knots and type of spline for each direction in the
bivariate space. Again, the B-splines are of order 2 and have a first-order difference penalty
applied indicated by (m=c(2, 1)). The summation over the domain of the functional pre-
dictor,
multiplying the coefficient surface β(u, s) by the covariate and quadrature weights,
l q l X i (ul ), is specified by the option by = XLmat. Smoothing parameter selection is done
using residual marginal likelihood (method = "REML") [121, 228, 317]. As this was imple-
mented in mgcv, one can easily use any of the other criteria available for estimation of
the smoothing parameters. K Together
K these components add to the desired term, the linear
predictor l ql Xi (ul ) k11=1 k22=1 βk1 k2 Bk1 (s)Bk2 (u).
Extracting the estimated coefficients, f0 (s) and β(u,
s), requires a bit more work than if
one used refund::pffr(), but the exercise is useful for understanding how to extract esti-
mated quantities from mgcv::gam(). We extract the estimated coefficients with estimated
standard errors separately. Recall that the predict method associated with objects obtained
from mgcv::gam() allows for predictions evaluated at: the link scale (type = "link", sum
of the linear predictor), the response scale (type = "response", inverse link function), the
design matrix (type = "lpmatrix"), or the individual components of the linear predictor
(type = "terms"). Here we will use primarily the output from type = "terms", but we
will also illustrate a use of type = "lpmatrix".
First consider f0 (s). We must first construct a dataframe which contains the domain of
the functional response at each point where we wish to evaluate. In our case, it makes sense
to evaluate at each week 21, . . . , 52, though due to the smoothness assumption we could
evaluate the function on a denser or coarser grid if desired. Accordingly, we must create a
dataframe where each row contains the domain of the response, s, at each point we wish
to make predictions on. Denote this vector as spred = [21, . . . , 52]t , which is denoted as S
in the code below. Because we only need the argument tvec to evaluate f0 (s), the other
arguments which must be supplied to predict.gam() (tmat, smat, XLmat) are arbitrary.
The code below does this
The object f0 hat is then a list with the first element containing a matrix with two
columns, corresponding to a set of predictions f0 and β evaluated at the values contained
in df pred f0. In this case, the matrix will have 32 rows. The second element of f0 hat
198 Functional Data Analysis with R
is a matrix with the same structure, but contains the estimated standard errors associated
with the predictions in the first element.
Next consider how to extract the β(u, s). To get estimates for this bivariate surface,
we need to create a dataframe which has all pairwise combinations of s and t that we
wish to obtain predictions on. We use an equally spaced grid on the domain of both the
functional predictor and response of length 100. Specifically, consider the equally spaced
grid of points upred = [1, . . . , 20]t with |upred | = 100 and spred = [21, . . . , 52]t with
|spred | = 100. The matrix of all pairwise combinations is what we need to create. Specifically,
we want to create the |upred ||spred | × 2 dimensional matrix [upred ⊗ 1|spred | , spred ⊗ 1|upred | ],
where 1|upred | and 1|spred | are both columns of ones of length |upred | and |upred |, respec-
tively. In our case, |upred | = |upred | = 100, but these choices can vary by application.
This can be conveniently created using the expand.grid function, as we show in the
code below when we build the df pred beta data frame. In addition, we need to con-
sider the fact that the linear predictor component predict.gam() evaluates the entire
K K
term l ql Xi (ul ) k11=1 k22=1 βk1 k2 Bk1 (ul )Bk2 (s) = l ql Xi (ul )β(ul , s). As such, to ob-
s) we need ql Xi (ul ) = 1 for ul ∈ upred . This is handled by setting XvL=1 in the call
tain β(u,
to predict.gam(). What may look like a “computational trick” is actually a sine-qua-non
procedure for extracting and conducting inference on the functional parameter. Since we do
not care here what values f0 (s) is evaluated at, the choice is arbitrary. The code to obtain
the predictors as described above is presented in the code chunk below.
The predictions obtained from predict.gam contain the estimated coefficient in long
format. Figure 6.11 presents the results obtained from calling mgcv::gam directly (left panel)
versus the results from refund::pffr() (right panel). The resulting coefficient surfaces are
identical, as expected given that we used the default settings in refund::pffr. Note that
there is a visual difference in the estimated surfaces due to the fact that the predict method
associated with refund::pffr() by default evaluates on a coarser grid than we used for the
mgcv::gam predictions (40 × 40 versus 100 × 100). All parameters estimated by the models
are identical.
FIGURE 6.11: Estimated β(u, s) in the FoFR model (6.1) for the regression of weekly all-
cause excess mortality in the US in the last 32 weeks on the first 20 weeks of 2020. Results
obtained using mgcv::gam() directly (left panel) and refund::pffr() (right panel).
using a simple call to the coef method associated with refund::pffr. Using the code
t) which can be
provided in Section 6.4 we can obtain pointwise standard errors for β(s,
used for inference. Here we first discuss unadjusted pointwise inference and then discuss
correlation and multiplicity adjusted (CMA) inference for the coefficient surface. We will
follow closely the methodological approaches described in Section 2.4 and we will point out
specific methods and software adjustments required for FoFR inference.
|upred ||upred |×(K1 K2 ) dimensional, where each row corresponds to a point where prediction
is conducted. Therefore, β(upred , spred ) is a |upred ||spred |×1 dimensional vector, where each
entry corresponds to an estimated functional parameter at a location (u, s) ∈ upred × spred .
For simplicity of presentation, let A = [B1 (u)⊗B2 (s)](u,s)∈upred ×spred . It follows that for
the bivariate grid of points upred × spred , Var{β(u pred , spred } = AVar(β)A
t . The diagonal
of this covariance matrix is all that is needed for constructing unadjusted pointwise confi-
dence intervals. This is provided directly by coef.pffr and predict.gam (when selecting
type="terms"). We provide code below showing the implementation using refund::pffr.
The data frame beta hat pffr created by the code above can then be passed to
ggplot2::ggplot for plotting, or transformed to matrix format and plotted using, for
example, fields::image.plot [217]. Figure 6.12 plots the unadjusted pointwise p-values
(left panel) along with lower and upper bounds for the 95% confidence intervals (middle and
left panels, respectively). First consider the plot of p-values in the left panel of Figure 6.12.
Regions that are blue represent areas where the pointwise p-value is less than 0.05, with
white regions indicating no statistical significance at the level α = 0.05. The regions which
are statistically significant indicate that weeks ≈ 18-20 are associated with future death
rates in the subsequent weeks (≈ 18-25) and weeks toward the end of the year (≈ 50-52).
In addition, deaths between weeks 8-13 and 15-18 are significantly associated with future
deaths over the majority of the follow-up period. Interestingly, if we look at the estimated
coefficient surface presented in Figure 6.11, we can see that the direction of the effect is
reversed (positive for weeks 8-13, negative for weeks 15-18). This suggests a cyclic nature
to the death rate, consistent with our exploratory findings.
We urge caution in interpreting statistical significance in this data application due to
the autocorrelation of residuals highlighted in Section 6.3.1. This autocorrelation violates
key assumptions of the model and may impact inference.
FIGURE 6.12: Unadjusted pointwise inference of FoFR: estimated p-values (left panel)
along with lower (middle panel) and upper (right panel) bounds for the 95% pointwise
s) from the FoFR model
confidence intervals for β(u,
Function-on-Function Regression 201
We can obtain the same results using the mgcv approach discussed in Section 6.4. Specif-
ically, the predict.gam() function returns pointwise standard errors which can be used to
construct pointwise 95% Wald confidence intervals. The code below constructs both point-
wise confidence intervals and p-values using mgcv.
confidence intervals based on parameter simulations introduced in Section 2.4.2 and joint
confidence intervals based on the max absolute statistics introduced in Section 2.4.3. We
also discuss some potential pitfalls associated with using the PCA or SVD decomposition
of the covariance operator.
for all values u ∈ upred and s ∈ spred . This confidence intervals can be inverted to form
both pointwise and global CMA p-values. Consider first pointwise confidence intervals and
fix (u, s). We can simply find the smallest value of α for which the above interval does not
contain zero (the null hypothesis is rejected). We denote this probability by ppCMA (u, s)
and refer to it as the pointwise correlation and multiplicity adjusted (pointwise CMA) p-
value. To calculate the global pointwise correlation and multiplicity adjusted (global CMA)
p-value, we define the minimum α level at which at least one confidence interval β(u, s) ±
q(Cβ , 1 − α)SE{β(u, s)} for u ∈ upred and s ∈ spred does not contain zero. We denote this
p-value by pgCMA (upred , spred ). As discussed in Section 2.4.4, it can be shown that
pgCMA (upred , spred ) = min{ppCMA (u, s) : u ∈ upred , s ∈ spred } .
Calculating both the pointwise and global CMA adjusted p-values requires only one
iteration of the simulation b = 1, . . . , B, though a large number of simulations, B, may
be required to estimate extreme p-values. The entire procedure is fairly fast as it does not
involve model refitting, simply simulating from a multivariate normal of reasonable dimen-
sion, in our case K1 K2 dimensional. Below we show how to conduct these simulations and
Function-on-Function Regression 203
calculate the CMA adjusted confidence intervals and global p-values. An essential step is
to extract the covariance matrix of β from the mgcv fit. This is accomplished in the ex-
pression Vtheta <- vcov(fit mgcv)[inx beta,inx beta]. Recall that a similar approach
was used in Chapter 4, which also depended intrinsically on extracting the covariance of the
The first code chunk focuses on obtaining the CMA confidence
spline coefficients, Var(β).
intervals at a particular confidence level, in this case α = 0.05.
We now show how to invert these confidence intervals and obtain the global CMA ad-
justed p-value pgCMA (upred , spred ). In practice the p-value we estimate is limited by the
number of bootstrap samples we use.
In this example, the p-value pgCMA (upred , spred ) is extremely small and we estimate that
it is < 0.001. We cannot say more unless we consider an increased number of bootstrap
samples, but this resolution of p-value is sufficient for most purposes. The conclusion being
that there is a statistically significant association between historical excess mortality and
future excess mortality, even after adjusting for the correlation between tests and multiple
comparisons.
We now show how to calculate the pointwise correlation and multiplicity adjusted p-
value ppCMA (upred , spred ). P-values obtained in this fashion are also constrained by the
number of bootstrap samples. The R code below calculates this quantity using the results
from the parametric bootstrap above.
The ppCMA (upred , spred ) p-values are stored in the matrix (p val lg). Figure 6.13 dis-
plays the CMA inference results using the same structure as Figure 6.12 using simulations
from the spline parameter distribution. The left panel of Figure 6.13 now presents the
pointwise CMA adjusted p-values ppCMA (upred , spred ). We find that after adjusting for cor-
relations and multiple comparisons, the regions which remain statistically significant are
generally the same, with the exception that the early portion of the history (weeks 0-10) is
no longer statistically significant, and the late history (weeks 18-20) is no longer associated
with excess mortality at the end of the follow up (weeks 50-52). The middle and right panels
in Figure 6.13 display the CMA 95% confidence intervals for β(upred , spred ).
FIGURE 6.13: Estimated pointwise CMA p-values denoted by ppCMA (u, s) (left panel) along
with CMA lower (middle panel) and upper (right panel) bounds for the 95% confidence
t) for FoFR based on simulations from the distribution of spline coefficients
intervals for β(s,
B, s ∈ spred , t ∈ tpred } is the collection of bootstrap estimates over the grid {spred , t ∈
tpred }. As with the parametric bootstrap, we then obtain db = maxs∈s,t∈t |β b (s, t) −
t)|/SE{β(s,
β(s, t)}. This procedure requires to extract the variances of β(u
pred , spred ), but
not the entire covariance matrix. However, it requires to refit the model B times, which
increases the computational complexity.
The code below illustrates how to perform the non-parametric bootstrap with repeated
model estimation done using refund::pffr. Using pffr makes the non-parametric boot-
strap easy to implement as the data are stored in wide format (as opposed to long format for
mgcv::gam estimation). However, to extract the coefficient predictions on the grid we need
to call the function predict.gam. To do so, we need to modify the variable names in our
data frame which we pass to predict.gam to mirror the naming conventions refund::pffr
uses when it constructs the long format data frame that is passed to mgcv::gam for model
estimation. The code below shows how to do this. Note that internally pffr refers to the
predictor domain as S and the response domain as T , where we use U and S to denote the
response and predictor domains, respectively.
Once the bootstrap simulations are obtained, calculations proceed similar to the ones
shown for the case when we simulate from the distribution of the spline coefficients. For this
reason, we omit this code chunk. Figure 6.14a and 6.14b display the unadjusted and CMA
inference using non-parametric bootstrap of the max absolute statistic, respectively. Results
are markedly different from the ones obtained using simulations from the spline parameter
distribution. Consider first the unadjusted inference shown in Figure 6.14a. We find that
fewer regions are statistically significant in Figure 6.14a compared to Figure 6.12, which is
based on simulations from the spline parameter distribution. Moreover, Figure 6.14a displays
the pointwise CMA inference based on the nonparametric bootstrap of the max statistic and
should be compared to Figure 6.13, which is based on simulations from the spline coefficients
distribution. Results are remarkably different, indicating that, effectively, no region of the
surface is statistically significant. This is due to increased variability of the estimated surface
when using a non-parametric bootstrap. Indeed, the p-value associated with the global test
using the non-parametric bootstrap is very large, which explains why there are some small
regions which are blue in the left panel of Figure 6.14b. This discrepancy between the
parametric and non-parametric bootstrap results can occur due to non identifiability of the
surface, high variability in smoothing parameter estimation, or heavier than expected tails
of the max statistic distribution compared to what would be obtained from a multivariate
normal distribution. We note that when the distribution of β is reasonably close to a
multivariate normal distribution, the non-parametric bootstrap of the max statistic and
the simulation from the spline coefficients distribution yield similar estimates. When the
smoothing parameter is close to the boundary of 0, then these approaches may give very
different results.
t . As
where Cβ is the correlation matrix corresponding to the covariance matrix AVar(β)A
we discussed in Section 2.4.1, we need to find a value q(Cβ , 1 − α) such that
P {q(Cβ , 1 − α) × e ≤ X ≤ q(Cβ , 1 − α) × e} = 1 − α ,
where X ∼ N (0, Cβ ) and e = (1, . . . , 1)t is the |upred ||spred | × 1 dimensional vector of ones.
Once q(Cβ , 1 − α) is available, we can obtain a CMA (1 − α) level confidence interval for
β(u, s) as
s) ± q(Cβ , 1 − α)SE{β(u,
β(u, s)} : ∀ s ∈ upred , s ∈ spred .
Function-on-Function Regression 207
(a) Estimated pointwise unadjusted p-values (left panel) along with unadjusted lower (middle
t) for FoFR based
panel) and upper (right panel) bounds for the 95% confidence intervals for β(s,
on non-parametric bootstrap
(b) Estimated pointwise CMA p-values denoted by ppCMA (u, s) (left panel) along with CMA lower
t) for FoFR
(middle panel) and upper (right panel) bounds for the 95% confidence intervals for β(s,
based on the non-parametric bootstrap of the max statistic
Luckily, the function qmvnorm in the R package mvtnorm [96, 97] is designed to extract such
quantiles. Unluckily, the function does not work for matrices Cβ that are singular and very
high dimensional. Therefore we need to find a theoretical way around the problem.
pred , spred ) = Aβ has a degenerate normal distribution, because its rank is
Indeed, β(u
at most K1 K2 , where K1 and K2 are the number of basis functions used in each dimension,
respectively. Moreover, we have evaluated β(s, t) on a grid of |upred |×spred | = 10,000, which
implies that the covariance and correlation matrices of β(u pred , spred ) is 10,000 × 10,000
dimensional. To better understand this, consider the case where β(s, t) = s + t + st. Then
rank(Cβ ) ≤ 4 for any choice of upred and spred .
We will use a statistical trick based on the eigendecomposition of the covariance ma-
pred , spred ) has a degenerate multivariate normal distribution of rank
trix. Recall that if β(u
m ≤ K1 K2 , then there exists some random vector of independent standard normal random
variables Q ∈ Rm such that
βt (s, t) = Qt D ,
208 Functional Data Analysis with R
with D t D = AVar(β)A t . If we find m and a matrix D with these properties, the problem
t = U ΛU t ,
is solved, at least theoretically. Consider the eigendecomposition AVar(β)A
t
where Λ is a diagonal matrix of eigenvalues and the matrix UU = I|upred |×|spred | is an
orthonormal matrix with the kth column being the eigenvector corresponding to the kth
eigenvalue. Note that all eigenvalues λk = 0 for k > m and
t = Um Λm U t ,
AVar(β)A m
where Um is the |upred | × spred | × m dimensional matrix obtained by taking the first m
columns of U and Λm is the m×m dimensional diagonal matrix with the first m eigenvalues
1/2
on the main diagonal. If we define D t = Um Λm and
#Get Var(hat(beta))
Vbeta <- Amat %*% Vtheta %*% t(Amat)
#Get eigenfunctions via svd
eVbeta <- svd(Vbeta, nu = ncol(Amat), nv = ncol(Amat))
#Get only positive eigenvalues
inx evals pos <- which(eVbeta$d >= 1e-6)
#Get "m" - dimension of lower dimensional RV
m <- length(inx evals pos)
#Get associated eigenvectors
U m <- eVbeta$v[,inx evals pos]
#Get D transpose
D t <- U m %*% diag(sqrt(eVbeta$d[inx evals pos]))
#Get D inverse
D inv <- MASS::ginv(t(D t))
#Get Q
Q <- t(Amat %*% theta) %*% D inv
#Get max statistic
Zmax <- max(abs(Q))
#Get p-value
p val <- 1 - pmvnorm(lower = rep(-Zmax, m), upper = rep(Zmax, m),
mean = rep(0, m), cor = diag(1, m))
7
Survival Analysis with Functional Predictors
211
212 Functional Data Analysis with R
fi (t|Zi )
λi (t|Zi ) = , (7.1)
Si (t|Zi )
where fi (t|Z
∞i ) is the conditional probability density function of the event time and
Si (t|Zi ) = t fi (u|Zi )du is the probability of surviving at least t. Modeling the hazard
function is different from standard regression approaches, which focus on imposing struc-
ture on the mean of the survival time distribution.
The Cox proportional hazards model assumes that
where λ0 (t) is the baseline hazard, γ = (γ1 , . . . , γQ )t , and exp(γq ), q = 1, . . . , Q, are the
hazard ratios. A value of γq > 0, or exp(γq ) > 1, indicates that larger values of Ziq corre-
sponds to shorter survival time. Cox models focus primarily on estimating the γ parameters
using the log partial likelihood
l(γ) = [Zti γ − log{ exp(Ztj γ)}] , (7.3)
i:δi =1 j:Yj ≥Yi
which does not depend on the baseline hazard, λ0 (t). The consistency and asymptotic
normality of the maximum log partial likelihood estimator were established in [297].
There are many ways to estimate the baseline hazard, but the best known approach is
due to Breslow [22]. Assume that an estimator, γ , of γ is available, say from maximizing
the log partial likelihood (7.3). By treating λ0 (t) as a piecewise constant function between
uncensored
t failure times, the Breslow estimator of the baseline cumulative hazard function
Λ0 (t) = 0 λ0 (u)du is
n
I(Yi ≤ t)
Λ0 (t) = t ) . (7.4)
i=1 j:Yj ≥Yi exp(Zj γ
∂
λi (t|Zi ) = − log{Si (t|Zi )} ,
∂t
Survival Analysis with Functional Predictors 213
which allows us to obtain the conditional survival function for the Cox model as
t
Si (t|Zi ) = exp{− λi (u|Zi )du} = exp{− exp(Zti γ)Λ0 (t)} .
0
This indicates that the conditional cumulative distribution function can be estimated by
This is a result that can be used for predicting survival time given a set of covariates.
Indeed, we can use either the mean or the median of the distribution Fi (t|Zi ) to predict
survival time. There is no closed-form solution for either of these predictors, but they can
be obtained easily using numerical approaches. We will also use the survival functions (7.5)
to conduct realistic simulations that mimic complex observed data.
The Cox proportional hazards model can be implemented in R using, for example, the
coxph function in the survival package [293, 295].
FIGURE 7.1: The structure of the NHANES data for six study participants. Each par-
ticipant is uniquely identified by their sequence number (SEQN). Each row displays the
survival outcome information (time, event) and a subset of predictors: age, BMI, gender,
and coronary heart disease indicator (CHD) for one study participant.
6.33. This is a Mexican American female who was 59 years old, had a BMI of 28.3 and did
not have CHD when she entered the study.
Figure 7.2 provides a visualization of the survival data structure for the same six study
participants. The left panel displays the sequence number (SEQN) of each study participant.
The middle panel displays the age, sex, BMI, race, and CHD diagnosis information of
the individual at the time of enrollment into the study. This is a different visualization
of the same information shown in Figure 7.1. The right panel in Figure 7.2 displays the
corresponding survival information. A red bar with an “x” sign at the end indicates that
the study participant died before December 31, 2019. The length of the bar corresponds to
the time between when the study participant enrolled in the study and when they died. For
FIGURE 7.2: The left panel displays the sequence number (SEQN) of each study partici-
pant. The middle panel displays some baseline predictors: age, BMI, gender, and coronary
heart disease indicator (CHD). The right panel displays whether the study participant died
(red line) or was still alive (black line) as of December 31, 2019. If the person died, the
survival time from entering the study is indicated. If the person was still alive, the time
between entering the study and December 31, 2019 (censoring time) is indicated.
Survival Analysis with Functional Predictors 215
example, the first study participant was 80 years old at the time of enrollment and died
3.33 years later. A black bar with a “•” sign at the end indicates that the study participant
was alive on December 31, 2019. For example, the fourth study participant was 59 years old
at the time of enrollment and was alive 6.33 years later on December 31, 2019. Traditional
survival analysis is concerned with analyzing such data structures where risk factors are
scalar (e.g., age, BMI). Functional survival data analysis is also concerned with accounting
for high-dimensional predictors, such as the minute-level accelerometry data. The next
section describes the data structure for survival data with a combination of traditional and
functional predictors.
FIGURE 7.3: An example of the survival data structure in R where accelerometry data is
a matrix of observations stored as a single column of the data frame.
216 Functional Data Analysis with R
FIGURE 7.4: The left panel displays the sequence number (SEQN) of each study par-
ticipant. The middle panel displays the average physical activity data from midnight to
midnight for six study participants (age at time of study enrollment also shown). The right
panel displays whether the study participant died (red line) or was still alive (black line)
as of December 31, 2019. If the person died, the time from the physical activity study is
indicated. If the person was still alive, the time study enrollment and December 31, 2019
(censoring time) is indicated.
and event are columns in the data frame and can be accessed as nhanes df surv$SEQN
and nhanes df surv$event, respectively. The 8,713×1,440 dimensional accelerometry data
matrix is stored as an entry in the data frame and can be accessed as nhanes df surv$MIMS.
This can be done using the I() function, which inhibits the conversion of objects. Addi-
tional functional predictors, if available, could be stored as additional single columns of the
data frame. This is the standard input format for the R refund package.
Figure 7.4 provides a visual representation of the data described in Figure 7.3. Also, the
information displayed is similar to that shown in Figure 7.2. The difference is that physi-
cal activity information data was added to the middle panel. The left panel still contains
the SEQN number, which provides the primary correspondence key for data pertaining to
individuals. The right panel contains the survival data and is identical to the right panel in
Figure 7.2.
Given this data structure, it is useful to consider the options that would make sense for
fitting such data. Indeed, simply plugging in the high-dimensional predictors with complex
correlations (e.g., physical activity data) in the right-hand side of the Cox proportional
hazards model (7.2) is not feasible. One solution could be to extract low-dimensional sum-
maries of the data (e.g., mean, standard deviation) and use them as predictors in the Cox
model. Another solution could be to decompose the functional data using principal compo-
nents (possibly smoothed) and plug in the principal component scores in the Cox regression
[139, 155, 252]. As in standard functional regression, the shape and interpretation of the
functional coefficient depends strongly on the number of principal components. Therefore,
in addition to these approaches, we will consider models that directly estimate the func-
tional coefficient using penalized regression. To build the intuition, we further explore the
Survival Analysis with Functional Predictors 217
data and provide a progression of models that build upon the standard Cox proportional
hazards model.
low PA group. In contrast, there were 1,333 study participants who were still alive 4 years
later in the high PA group.
For the study participants between the ages of 20 and 50 the probability of survival
is very high at all times irrespective of the physical activity level group. Moreover, there
are no visual differences between the KM curves. This is reassuring, as it corresponds to
what is expected among young adults. For study participants between the ages of 50 and
80 the survival probabilities are much lower and there are substantial differences between
the three PA subgroups. The survival probability in the high PA group is higher than in
the moderate PA group and much higher than in the low PA group.
TABLE 7.1: Model 1 shows the coefficient and p-values from a Cox proportional hazard
model assessing the association between age, gender, race and risk of death. Model 2 expands
model 1 by adding BMI and CHD status to the model. Model 3 expands model 2 by adding
total MIMS (TMIMS) to the model. The baseline level of race is Mexican American, and
the baseline level of CHD is No (not diagnosed with CHD at the time of interview).
Models M1 M2 M3
C-index 0.754 0.760 0.783
β p β p β p
age 0.102 <0.001 0.098 <0.001 0.071 <0.001
gender: Female −0.273 <0.001 −0.214 0.005 −0.060 0.442
race: Non-Hispanic White 0.211 0.196 0.183 0.263 0.072 0.659
race: Non-Hispanic Black 0.180 0.294 0.210 0.222 0.084 0.626
race: Non-Hispanic Asian −0.390 0.090 −0.412 0.075 −0.469 0.043
race: Other Hispanic −0.170 0.432 −0.163 0.450 −0.149 0.489
race: Other Race 0.543 0.070 0.519 0.084 0.437 0.146
BMI −0.006 0.384 −0.025 <0.001
CHD: Yes 0.568 <0.001 0.441 <0.001
TMIMS −0.0002 <0.001
Survival Analysis with Functional Predictors 219
TMIMS (model M3). Indeed, the hazard is estimated to increase by exp(0.102) = 1.108 (or
10.8%) for a 1 year increase in age in M1. In M3 the increase in hazard is exp(0.071) = 1.073
(or 7.3%) for individuals with the same level of TMIMS (and other covariates). This result
shows that there is a strong association among the variables age, TMIMS, and mortality.
It also suggests physical activity, a modifiable risk factor, may be a target for interventions
designed to reduce mortality.
The point estimator for “gender: Female” is negative in all models, which corresponds
to a lower risk of mortality. The result is not surprising as it is well known that women
live longer than men. Indeed, in the US the life expectancy at birth for females has been
about 5 years longer than for males; see Figure 1 in [7] on Life expectancy at birth by sex
in the US 2000–2020. In our data, this effect is not statistically significant in models M3
(p-value=0.442). The reduction in the point estimate for the log hazard is substantial when
accounting for CHD. However, the reduction in the point estimator is extraordinary when
accounting for TMIMS.
In these data, none of the race categories had a significantly different mortality hazard
compared to the reference category Mexican American.
Results for BMI are qualitatively different in models M2 and M3. Indeed, in M2 there
is no statistically significant association (at significance level α = 0.05) between BMI and
hazard of mortality. However, in model M3, which adjusts for TMIMS, there is an estimated
protective effect of increased BMI. Such effects are known in the literature as the “obesity
paradox” [215, 220, 225]. Our models were not designed to address this paradox, but it does
suggest that (1) after accounting for age, gender, race, and CHD the obesity paradox is
not statistically significant; and (2) after accounting for an objective summary of physical
activity, the obesity paradox is statistically significant with a large change in the point
estimator. This suggests a strong dependence between BMI, objective physical activity, and
mortality risk.
Both models M2 and M3 indicate that having and reporting a history of CHD (category
Yes) is a strong, statistically significant predictor of mortality. The hazard ratio in model M2
is estimated to be exp(0.568) = 1.76 relative to study participants in the CHD No category.
In M3 the increase in hazard is exp(0.441) = 1.55 compared to individuals of the same age,
gender, race and TMIMS levels. The reduction in the estimated hazard rate between model
M2 and M3 is due to the introduction of the objective physical activity summary TMIMS.
TMIMS is highly significant (p-value< 0.001) after accounting for age, gender, race, BMI,
and CHD. Including this variable substantially affects the estimated effect and interpretation
of all other variables, except race. This indicates that objective measures of physical activity
are important predictors of mortality and that they interact in complex ways with other
traditional mortality risk factors. The effect on hazard ratio is not directly interpretable
from the point estimator because MIMS are not expressed in standard units. However, the
estimator indicates a covariate-adjusted hazard ratio of exp{−0.0002 × (Q3 − Q1 )} = 0.36,
when comparing a study participant at the third quartile with one at the first quartile of
TMIMS. As a comparison, the covariate-adjusted hazard ratio is exp{0.071 × (Q1 − Q3 )} =
0.48 when comparing a study participant at the first quartile with one at the third quartile
of age.
In model M3 we combine the objective physical activity data with other predictors by
taking the summation of activity intensity values across time of day. While the newly created
TMIMS variable is interpretable under the Cox proportional hazard model, one can argue
that we lose substantial information on within-day physical activity variability through such
compression. One solution is to calculate the total MIMS in two-hour windows (0-2 AM,
2-4 AM, ...) and use them as 12 separate predictors in a Cox proportional hazard model.
However, the interpretation of these predictors becomes challenging due to their intrinsic
correlation between adjacent hours. For example, the TMIMS at 2-4 AM is very close to
220 Functional Data Analysis with R
the TMIMS at 4-6 AM for most study participants. In addition, we still lose the variability
within each two-hour window.
Another solution is to directly model the minute-level physical activity data, as they can
be viewed as functional data observed on a regular grid (minute) of the domain (time of day
from midnight to midnight). However, this leads to both methodological and computational
challenges, since it is unreasonable and impractical to simply include minute-level intensity
values as 1440 predictors in a Cox proportional hazard model. To solve these challenges, we
introduce penalized functional Cox models in the next section, which model the functional
predictors observed at baseline and survival outcome through penalized splines. We also
describe the detailed implementation of these models using the refund and mgcv software.
In this model neither Wi (s) nor β(s) depend on t, the domain of the event time process.
This can be a bit confusing, as Wi (s) is a function of time, though s refers to time within the
baseline visit day, whereas t refers to time from the baseline visit. Therefore, the proportional
hazards assumption of the Cox model is reasonable and could be tested.
Interpreting the association between the functional predictor and the log hazard follows
directly from the interpretation of the functional linear model. More precisely, one inter-
pretation is that the term exp S β(s)ds corresponds to the multiplicative increase in one’s
hazard of death if the entire covariate function, Wi (s), was increased by one (with Zi held
constant). More generally, β(s) is a weight function for Wi (s) to obtain its overall contri-
bution towards the hazard of mortality over the functional domain S. It may be helpful to
center Wi (s) such that β(s) can be interpreted as a unit change in the log hazard relative
to some reference population (e.g., the study population average).
Survival Analysis with Functional Predictors 221
Surprisingly, there are few published methods for estimating Model (7.5). In particular,
[94, 155, 238] proposed different versions of the “linear functional Cox model”, which in-
cluded a linear functional term of the form S Wi (s)β(s)ds in the log-hazard expression to
capture the effect of the functional covariate {Wi (s) : s ∈ S}. Here we focus on estimating
the linear functional Cox model based on penalized regression splines, though we also touch
on the functional principal components basis method proposed by [155].
7.3.1.1 Estimation
The first step of the estimation procedure is to expand the functional coefficient as β(s) =
K
k=1 βk Bk (s), where B1 (s), . . . , BK (s) is any functional basis. We will use spline bases and
penalize the roughness of the functional coefficient β(·) via a quadratic penalty on the βk ,
k = 1, . . . , K parameters, though other approaches have been used in the literature.
With this basis expansion, model (7.5) becomes
K
log λi {t|Zi , Wi (·)} = log{λ0 (t)} + Zti γ + Wi (s) βk Bk (s)ds . (7.7)
S k=1
A popular alternative is to expand both the functional predictor, Wi (s), and the functional
coefficient, β(s), in the space spanned by the first K functional principal components of the
functional predictor. With this approach, model (7.7) becomes a standard Cox regression
on the first K scores for each participant, estimated via standard Cox regression software
(e.g., the survival package). This approach tends to perform well in terms of prediction,
though it can provide functional coefficient estimates that can vary wildly with the choice
of the number of principal components. We avoid this problem by expanding the functional
coefficient in a rich enough basis and then penalizing the roughness of the coefficient directly.
The philosophy is similar to the semiparametric smoothing [258], but applied to functional
data. This is a subtle point that is likely to become increasingly accepted.
We now provide the details of our approach, which builds on the penalized Cox model
introduced by [94]. Recall that we only observe Wi (s) at a finite number of points s1 , . . . , sp .
Therefore, the following approximation can be used for model (7.7)
K
log λi {t|Zi , Wi (·)} = log{λ0 (t)} + Zti γ + Wi (s) βk Bk (s)ds
S k=1
p
K
≈ log{λ0 (t)} + Zti γ + qj Wi (sj ) βk Bk (sj )
j=1 k=1
K
p
= log{λ0 (t)} + Zti γ + βk qj Wi (sj )Bk (sj ) , (7.8)
k=1 j=1
p
where qj is the quadrature weight. Note that j=1 qj Wi (sj )Bk (sj ) = Cik is a random
variable that depends on the study participant, i, and on the basis function number, k.
This becomes a standard Cox regression on the covariates Zi and Ci = (Ci1 , . . . , CiK )t .
Because K is typically large, maximizing the Cox partial likelihood results in substantial
overfitting. The penalized spline approach addresses overfitting by adding a penalty on the
curvature of β(s). This penalty is added to the Cox partial log likelihood of (7.3), resulting
in the penalized partial log likelihood
l(γ, β) = [(Zti γ + Cti β) − log{ exp(Ztj γ + Ctj β)}] − λP (β) . (7.9)
i:δi =1 j:Yj ≥Yi
222 Functional Data Analysis with R
For a given fixed basis there are multiple choices for the penalty term. The mgcv package
offers a wide variety of bases and penalty terms, including quadratic penalties of the form
λP (β) = λβDβ t where D is a known matrix and λ is a scalar smoothing parameter.
This form can easily be extended to multiple smoothing parameters. Penalties of this form
are quite flexible and include difference penalties associated with penalized B-splines [71],
second derivative penalties associated with cubic regression splines and thin plate regression
splines [221, 258, 303, 313] and many more [319]. The balance between model fit (partial log
likelihood) and the smoothness of β(·) (penalty term) is controlled by the tuning parameter
λ, which can be fixed or estimated from the data. There are many methods for smoothing
parameter selection in nonparametric regression, though fewer were developed and evaluated
for the linear functional Cox model. For example, [94] proposed to use an AIC-based criterion
for selecting λ. Here, we use the marginal likelihood approach described in [322].
From a user’s perspective, the challenge is how to make the transition from writing com-
plex models to actually setting-up the software syntax that optimizes the correct penalized
log likelihood. This is a crucial step that needs special attention in line with our philosophy:
“writing complex models is easy, but choosing the appropriate modeling components and
implementing them in reproducible software is not.”
As with scalar-on-function regression for exponential family outcomes, the refund pack-
age makes model estimation easy. Suppose we have a data frame of the form presented in
Figure 7.3 and we wish to fit the linear functional Cox model with age, BMI, gender and
CHD as scalar predictors, and participant average minute-level MIMS as a functional pre-
dictor. This model can be fit using the refund::pfr() function as follows:
The refund package uses mgcv for the model estimation routine and survival outcomes
are specified differently than in most other R survival regression functions. Specifically, mgcv
currently only accepts right censored survival data indicated by the event variable specified
in the weights argument. Otherwise, syntactically the model is written exactly as it would
be for any other response distribution outcome. That is, we specify time to event (time)
as the outcome, which is a linear function of scalar predictors age, BMI, gender, CHD, and
a linear functional term of the functional data MIMS stored as a matrix within the data
frame nhanes df surv. The functional parameter β(·) is modeled using cyclic splines (bs =
"cc") since time of day, the functional domain, is cyclical, using K = 30 basis functions (k
= 30). The last component is to specify the statistical model using the family = cox.ph()
argument.
As with Chapters 4-6, although the utility of the refund package makes model fitting
straightforward, it is instructive to understand what refund is doing “under the hood.”
To see this, we fit the same model using mgcv::gam() directly. To do so, in addition
to the data inputs required for estimation via pfr, we require (1) the matrix associated
with the functional domain S = 1n ⊗ st where st = [s1 , . . . , sp ] is the row vector containing
the domain of the observed functions (in the case of the NHANES data, p = 1440 and
st = [1, . . . , 1440]/60) and ⊗ denotes the Kronecker product; and (2) a matrix containing
the quadrature weights associated with each functional predictor multiplied elementwise by
the functional predictor WL = W L where W is the matrix with each row containing one
participants’ MIMS profiles and L = 1n ⊗ qt with qt = [q1 , . . . , qp ] is the row vector con-
taining the quadrature weights. The code below constructs these matrices and adds them
to the data frame which we will pass to mgcv. Note that these required data inputs are
Survival Analysis with Functional Predictors 223
precisely the same as were required for fitting scalar-on-function regression in Chapter 4
with one functional predictor.
In our example, the functions are observed on a regular, equally spaced grid, so con-
structing the matrices of interest is relatively straightforward. The choice of the functional
domain is effectively arbitrary, and here we choose S = [0, 1] to match the default behavior
of refund::pfr(). Therefore, we start with an equally spaced grid on [0, 1] stored in the
svec variable. This vector corresponds to the grid sj , j = 1, . . . , p, where p corresponds to
nS in the R code. Given our choice of S, the quadrature weights associated with Riemann
integration are simply 1/1440 (1440 observations per function) and are stored in the lvec
variable. This vector corresponds to the quadrature weights qj , j = 1, . . . , p. These vectors
are common to all study participants, and they are simply stored by row as a matrix in
our data frame as the variables S and L variables, respectively. The last step is to create a
matrix corresponding to the pointwise product of the quadrature weights and the functional
predictor, stored as W L. The matrix W L is the n × p dimensional matrix with the (i, j)th
entry equal to qj Wi (sj ). Only minor changes to the code above would be necessary for
irregular data or other quadrature weights (e.g., trapezoidal, Simpson’s rule). With these
data objects constructed, we can fit the linear functional Cox model using mgcv directly.
This syntax is similar to that used by refund::pfr() with the exception of the code used
to construct the linear functional term. We now connect the mgcv syntax to the penalized
partial log likelihood for our Cox model. First, note that the s() function is the mgcv smooth
term constructor. By specifying s(S, bs = "cc", k = 30, ...), mgcv effectively adds the
K=30
term k=1 βk Bk (sj ) to the linear predictor where the basis, Bk (·), contains cyclic cubic
splines (see ?smooth.terms; other bases can be specified using the bs argument). The final
component is the by argument within the call to s(), which instructs mgcv to multiply
each term specified in the smooth constructor by the corresponding object, in this case W L.
Recall that W L is an n × p dimensional matrix with the (i, j)th entry equal to qj Wi (sj ) in
the notation of (7.8). Together, this syntax adds the term
p
K K
p
{ βk Bk (sj )}{Wi (sj )qj } = βk {qj Wi (sj )Bk (sj )}
j=1 k=1 k=1 j=1
to the linear predictor, which corresponds to the functional component in model (7.8). The
final component is the penalization. This is added automatically to the log likelihood by
224 Functional Data Analysis with R
FIGURE 7.6: Estimated functional coefficient, β(s), obtained by calling (A) mgcv::gam()
directly; and from (B) refund::pfr(). Point estimates are shown as solid black lines and
the shaded regions correspond to 95% pointwise unadjusted confidence intervals. A red
dashed line corresponds to a null effect of β(s) = 0 for every s.
the s() function, though an unpenalized fit could be specified by supplying the argument
fx = FALSE to the call to s().
We plot the results of our two models and demonstrate their nearly identical estimates
for β(s) in Figure 7.6. Although the plots look very similar, they are not identical because
pfr uses trapezoidal integration by default instead of midpoint Riemann integration. A
version of the plots presented in Figure 7.6 can be created quickly using plot.gam(), a
plot function specifically for the gam class objects. Notice that the plot function for pfr
internally calls plot.gam() as well. In the following sections, we describe how to manually
extract point estimates and their corresponding standard error estimates for the functional
coefficient, as well as predicted survival probabilities. This will allow us to make personalized
plots outside of the plot.gam() function, such as Figure 7.6, and allow the extraction of
important components that can be used for other inferential purposes.
Var{β(s)}
= Var{B(s)β}
= B(s)Var(β)B(s) t
,
where B(s) = [B1 (s), . . . , BK (s)] is the row vector containing the basis functions used to
model β(s) evaluated at s. These confidence intervals retain the properties of smooth terms
estimated via penalized regression splines in generalized additive models [216, 258]. That
is, the coverage is nominal when averaged across the domain, S, but may be above or below
the nominal level for any given s. As such, users should be cautious of over-interpreting
inference confidence intervals constructed using this procedure.
The pfr() function has a coef.pfr() method, which allows for easy extraction of both
coefficient estimates and the associated standard errors. Obtaining estimated standard er-
rors from the mgcv::gam() fit requires more work, but can be done using the functionality
of the predict.gam method. The first step with both approaches is to create a vector of
points on the functional domain, spred , where we want to make predictions. From there, a
simple call to coef.pfr() provides the pfr() results. For the mgcv::gam() fit, a deeper
dive is necessary into the functionality of predict.gam(). Specifically, the predict.gam()
function allows users to obtain contributions to the linear predictor associated with each row
of a new data frame supplied to the function. First, recall the form of the linear predictor
presented in (7.13), for a given sj . If we create a data frame with Wi (sj ) = 1 and qj = 1,
then the contribution for the “new” data is simply β(s j ) = K βk Bk (sj ). Thus, we need
k=1
only create a new data frame according to this procedure with each row corresponding to
each element of spred . The code below shows how to obtain the fits for pfr() and mgcv.
A few points are worth noting. First consider the call to coef.pfr(). The n argument
specifies the number of equally spaced points on the functional domain to evaluate on.
Setting n = slen pred corresponds to evaluations on sind pred. It is possible to specify
these points directly, but we prefer this more direct approach. Specifying useVc=FALSE
is done to be consistent with the default behavior of predict.gam() which, by default,
returns variance estimates conditional on the estimated smoothing parameter. For some
models fit by ML/REML, adjustments for uncertainty in the smoothing parameter can
226 Functional Data Analysis with R
be done. Moving to the mgcv::gam() fit, consider the data frame df lfcm gam, which is
supplied to the predict.gam() function. This data frame has two columns, with each row
corresponding to a point s ∈ spred (syntactically denoted as sind pred.) The first column
in df lfcm gam specifies the value of smat, which is the name of the matrix of the functional
domain used in model, as each element of spred . The second column, wlmat, corresponds to
the term qj Wi (sj ), which we set to 1 per the formula described in the previous paragraph.
H0 : β(s) = 0, ∀ s ∈ spred
H0 : β(s0 ) = 0, s0 ∈ R ⊂ spred
while maintaining a family-wise error rate (FWER) of α for all s ∈ spred . Below we present
four approaches to performing these hypothesis tests. The discussion here is somewhat
abbreviated and we refer readers to 4.5.2 for further discussion.
where the approximation is a result of the variability inherent in simulation and the asymp-
Inverting the p-value, we obtain 1 − α correlation and multiplicity
totic normality of β.
adjusted confidence intervals of the form
± q(Cβ , 1 − α)SE{β(s)}
β(s)
Figure 7.7 presents the results from the procedure above. The left panel of Figure 7.7
plots the pointwise CMA p-values (light gray line) and the pointwise unadjusted p-values
(dark gray line). The right panel of Figure 7.7 plots the 95% CMA confidence bands (light
gray shaded region) and the unadjusted 95% pointwise confidence intervals (dark gray
shaded region). We see that the majority of the times of day where the pointwise inter-
vals suggested statistical significance continue to be statistically significant at α = 0.05.
228 Functional Data Analysis with R
FIGURE 7.7: Pointwise CMA inference for SoFR based on simulations from the distribution
of spline coefficients. BMI is the outcome and PA functions are the predictors. Left panel:
estimated pointwise unadjusted (dark gray) and CMA (light gray) p-values denoted by
ppCMA (s). Right panel: the 95% pointwise unadjusted (dark gray) and CMA confidence
intervals for β(s).
Figure 7.8 presents the results of the non-parametric bootstrap using the same format
as Figure 7.7, with pointwise CMA and unadjusted p-values in the left panel (light/dark
gray, respectively) and 95% CMA confidence bands and unadjusted pointwise confidence
intervals in the right panel (light/dark gray, respectively). Interestingly, the nonparametric
bootstrap produces narrower global confidence bands than the parametric bootstrap, with
a global 95% multiplier of ≈ 2.76 versus ≈ 2.92 for the parametric bootstrap. It is unclear
why the nonparametric bootstrap produces narrower confidence bands at this time, but
practical interpretations are largely similar. The p-values associated with the global null
hypothesis are estimated to be smaller than one divided by the number of bootstraps in
both cases, which suggests a highly significant association between diurnal physical activity
patterns and mortality.
FIGURE 7.8: Pointwise CMA inference for the linear functional Cox model based on the
nonparametric bootstrap of the max statistic. Left panel: estimated pointwise unadjusted
(dark gray) and CMA (light gray) p-values denoted by ppCMA (s). Right panel: the 95%
pointwise unadjusted (dark gray) and CMA confidence intervals for β(s).
where Cβ is the correlation matrix corresponding to the covariance matrix Var{β(s pred )} =
t
B(spred )Var(β)B (spred ). As we discussed in Section 2.4.1, we need to find a value q(Cβ , 1−
α) such that
P {q(Cβ , 1 − α) × e ≤ X ≤ q(Cβ , 1 − α) × e} = 1 − α ,
where X ∼ N (0, Cβ ) and e = (1, . . . , 1)t is the |spred | × 1 dimensional vector of ones. Once
q(Cβ , 1 − α) is available, we can obtain a CMA 1 − α level confidence interval for β(s) as
However, because its rank of β(s pred ) = B(spred )β is at most K, it has a degenerate
normal distribution where K is the number of basis used to estimate β(s). Since we have
pred )
evaluated β̂(s) on a grid of |spred | = 100, the covariance and correlation matrices of β(s
are 100 dimensional. A simple example for understanding this problem is to consider the
case where β̂(s) = c + s for c ∈ R. Then rank(Cβ ) ≤ 2 for any choice of spred .
pred ) has a degenerate multivariate normal distribution of rank m ≤ K,
Recall that if β(s
then there exists some random vector of independent standard normal random variables
Q ∈ Rm such that
βt (spred ) = Qt D ,
with Dt D = B(spred )Var(β)B t (spred ). If we find m and a matrix D with
these properties, the problem is theoretically solved. Consider the eigendecomposition
t (spred ) = UΛUt , where Λ is a diagonal matrix of eigenvalues and the
B(spred )Var(β)B
t
matrix UU = I|upred |×|spred | is an orthonormal matrix with the kth column being the
eigenvector corresponding to the kth eigenvalue. Note that all eigenvalues λk = 0 for k > m
and
t (spred ) = Um Λm Ut ,
B(spred )Var(β)B m
Survival Analysis with Functional Predictors 231
where Um is the |spred | × m dimensional matrix obtained by taking the first m columns of
U and Λm is the m × m dimensional diagonal matrix with the first m eigenvalues on the
1/2
main diagonal. If we define Dt = Um Λm and
The resulting p-value is numerically 0. Unlike in Chapter 4, the results here largely ap-
pear to agree with both the bootstrap approaches and the mgcv summary output, though
given the very small values of the p-values in this context, it is hard to verify them via
simulations in this scenario. This is an example of why we say that the p-values from these
methods sometimes agree, but we urge caution in general.
#Get hat(beta)/SE(hat(beta))
beta hat std <- beta hat / se beta hat
#Get Var(hat(beta)/SE(hat(beta)))
Vbeta hat <- Bmat %*% Vbeta %*% t(Bmat)
#Jitter to make positive definite
Vbeta hat PD <- Matrix::nearPD(Vbeta hat)$mat
#Get correlation function
Cbeta hat PD <- cov2cor(matrix(Vbeta hat PD, 100, 100, byrow = FALSE))
#Get max statistic
Zmax <- max(abs(beta hat std))
#p-value
p val <- 1 - pmvnorm(lower = rep(-Zmax, 100), upper = rep(Zmax, 100),
mean = rep(0, 100), cor = Cbeta hat PD)
The resulting p-value is 2.9×10−6 , which is very close to that obtained from the paramet-
ric bootstrap. In principle, this method should yield the same result as the exact solution,
though we find with some consistency that it does not. In our experience this approach pro-
vides results which align very well with the parametric bootstrap, though the theoretical
justification for jittering of the covariance function is, as of this writing, not justified. To
construct a 95% CMA confidence interval, one may use the mvtnorm::qmvnorm function,
illustrated below. Unsurprisingly, the resulting global multiplier is almost identical to that
obtained from the parametric bootstrap. For that reason we do not plot the results or in-
terpret further.
Z global <- qmvnorm(0.95, mean = rep(0,100), cor = Cbeta hat PD, tail =
"both.tails")$quantile
for t ∈ T . Extracting predicted survival probabilities requires users to present new data in
the same form as was supplied to the estimating function, both for pfr() and mgcv::gam(),
with one row for each survival time of interest. In contrast to obtaining predictions for the
functional coefficient, β(·), the procedures for obtaining predicted survival probabilities are
similar for pfr() and mgcv::gam(). Therefore, we show how to obtain them using the
mgcv::gam() function.
Suppose that we were interested in obtaining predicted survival curves for individuals
with the unique identifiers SEQN of 80124 and 62742 from the FLCM on a fine grid over the
first four years of follow-up. Study participant 62742 is a 70-year-old Other Hispanic male
with a BMI of 28.8. Study participant 80124 is a 70-year-old White male with a BMI of
Survival Analysis with Functional Predictors 233
32.7 and with CHD. These two participants were chosen to match on age, gender, and to
have similar BMI, differing primarily on their activity profile Wi (s). As a result, differences
in estimated survival probabilities will be primarily associated with the differences in their
average activity levels. The code below shows how to obtain predicted survival probabilities
for these two individuals.
Going through the code above step-by-step, we first choose a grid on which to obtain
predicted survival probabilities. Here, we specify an equally spaced grid (tind pred) over 0
(t min) to 4 (t max) years. Next, we specify the participants we wish to make predictions for,
SEQN 80124 and 62742. Next, we need to create a dataframe with one row containing each
individual’s baseline covariate values and each prediction time (df plt). Then we obtain
survival predictions using the predict function with type = "response" to get predictions
on the response (survival probability) scale and se.fit = TRUE to obtain the corresponding
standard errors. While standard errors on the covariate-dependent portion of the log hazard
can be obtained from using the Fisher information matrix from the partial log likelihood,
inference on survival probabilities requires one to incorporate uncertainty in the estimate
for the cumulative baseline hazard function Λ(t). There are many methods for obtaining
standard errors on survival curves from a Cox model, though the method used by mgcv is
described in [151]. Note that this process would involve additional steps in the presence of
either time-varying covariates or time-varying effects.
Using the data frame surv preds created above, we can create plots of estimated sur-
vival probabilities with unadjusted pointwise 95% confidence intervals for each of the study
participants. Figure 7.9 displays the ten-minute binned MIMS diurnal profiles, Wi (s), in the
234 Functional Data Analysis with R
FIGURE 7.9: MIMS profiles (left panel) and estimated survival curves (right panel) obtained
from the fitted FLCM for two participants, SEQN 62742 (red lines) and SEQN 80124 (blue
lines). MIMS profiles were binned into 10-minute intervals prior to plotting for readability.
Estimated survival probabilities are presented as solid lines with 95% confidence intervals
as dashed lines.
left panel and the corresponding estimated survival curves as a function of time in the right
panel for each of SEQN 62742 (red curves) and 80124 (blue curves). We see that SEQN 80124
was overall notably less active during the day, and more active at night. Recall that the
shape of β(s) presented in Figure 7.6 implies that higher activity, particularly during mid-
day, is associated with lower risk. This effect on estimated survival probabilities is reflected
in the right panel of Figure 7.9, with lower estimated survival probability of SEQN 80124.
Given that these individuals were matched on the other covariates included in our Cox
model, these differences are almost entirely attributable to differences in physical activity
patterns.
Estimating model (7.10) using the framework of penalized splines to estimate non-linear
effects proceeds in a similar fashion as was described for the linear functional Cox model in
Section 7.3.1. Specifically, we apply a set of bases to both the functional coefficient, β(s),
and the non-linear effects of non-functional covariates, fq (Ziq ), and penalize the curvature of
the estimated effects through an additive penalty on the log partial likelihood. Model (7.10)
becomes
Q1 Q Kq
log λi {t|Zi , Wi (·)} = log{λ0 (t)} + γq Ziq + αkq Bk,q (Ziq )
q=1 q=Q1 +1 k=1
Kβ
+ Wi (s) βk Bk (s)ds ,
S k=1
where Kq is the number of basis functions used to model fq (·) and αkq and Bk,q are the cor-
responding spline coefficients and basis functions, respectively. Approximating the integral
term numerically, combining terms, and using vector notation, the model can be re-written
as
Q
log λi {t|Zi , Wi (·)} = log{λ0 (t)} + Zti,1 γ + Bti,q αq + Cti β ,
q=Q1 +1
where Zi,1 = (Zi1 , . . . , ZiQ1 )t , γ = (γ1 , . . . , γQ1 )t correspond to the linear covariate effect
Q1 t t
q=1 γq Ziq , Bi,q = {B1,q (Ziq ), . . . , BKq ,q (Ziq )} and αq = (α1q , . . . , αKq q ) correspond to
Q
the Q2 = Q−Q1 nonparametric effects of scalar covariates q=Q1 +1 fq (Ziq ), while Ci and β
are defined as in Section 7.3.1.1 and correspond to the functional predictor S Wi (s)β(s)ds.
The corresponding penalized log partial likelihood to be maximized is then
Q
l(β, ξ) = [ηi − log{ exp(ηi )}] − λq P (αq ) − λβ P (β) , (7.11)
i:δi =1 j:Yj ≥Yi q=Q1 +1
Q
where ηi = Zti,1 γ + q=Q1 +1 Bti,q αq + Cti β, P (αq ) is a known penalty structure matrix
and λq is an unknown smoothing parameter for the nonparametric function fq (·), while
P (β) is a known penalty structure matrix and λβ is an unknown smoothing parameter for
the functional parameter β(·). All smoothing parameters are estimated using a marginal
likelihood approach.
In our data example we set Q = 4 and allow two predictors (Q1 = 2) to have a linear
effect on the log hazard (gender and CHD) and two predictors (Q2 = 2) to have a smooth
nonparametric effect on the log hazard (age and BMI). The code to fit the LFCM with
additive smooth terms for age and BMI via refund::pfr is presented below. Given the
correlation between age, movement, and weight, we are interested in how these estimated
associations change when MIMS profiles are excluded from the model. Thus, we also esti-
mate an additive Cox model which excludes the physical activity (no functional predictor)
for comparison.
236 Functional Data Analysis with R
#Fit the additive Cox model with smooth effects for age and BMI
fit acm <- gam(time ∼
s(age, bs = "cr", k = 30) +
s(BMI, bs = "cr", k = 30) +
gender + CHD,
weights = event, data = nhanes df surv,
family = cox.ph())
#Fit the additive Cox above with a linear functional effect of MIMS
fit acm lfcm <- pfr(time ∼
s(age, bs = "cr", k = 30) +
s(BMI, bs = "cr", k = 30) +
gender + CHD +,
lf(MIMS, bs = "cc", k = 30),
weights = event, data = nhanes df surv,
family = cox.ph())
The resulting coefficient estimates from these models can be plotted quickly using the
plot.gam function. Specifically, one could plot the coefficients using the following code.
With some additional effort for formatting, these plots become Figure 7.10. Fig-
ure 7.10(A) presents the estimated coefficients f (age), f (BMI) from the additive Cox model
in the left and middle panels, respectively. Figure 7.10(B) presents the same coefficients es-
timated in the additive functional Cox model along with the estimated effect of physical
activity, β(s), in the right column. Estimated age and BMI effects are non-linear in both
models, with the estimated BMI effect more strongly non-linear. In the additive Cox model,
the lowest risk BMI is estimated to be around 30, which corresponds to the border between
“overweight” and “obese” BMI. This effect, referred to as the “obesity paradox” has been
observed in other studies, with potential confounding mechanisms proposed [163, 215, 225].
A full discussion of this topic is beyond the scope of our data application. Adjusting for
participants’ MIMS profiles (linear functional additive Cox model), the estimated effects
of both age and BMI are attenuated (pulled toward 0). Interestingly, adjusting for indi-
viduals’ activity patterns, the lowest risk BMI is shifted to close to around 40-45, though
confidence intervals for the BMI effect in both models are quite wide. This may be due
to the relatively short follow-up length for mortality data currently available in NHANES
2011-2014. A final observation is that, (non-linearly) adjusting for age and BMI, the esti-
mated association of activity is attenuated slightly as compared to the results presented in
Figure 7.6.
FIGURE 7.10: (A) Estimated associations between age (left panel) and BMI (right panel)
and the log hazard of all-cause mortality in an additive Cox model. (B) Estimated asso-
ciations between age (left panel) and BMI (right panel) and the log hazard of all-cause
mortality in an additive Cox model. Point estimates (solid black lines) and 95% point-wise
confidence intervals (gray shaded regions) are presented. Both models adjust for gender and
congestive heart failure as linear effects.
outcome, one limitation is that it only allows a linear association between the functional
predictor at each location of the domain and the log hazard. That is, for each s ∈ S, the
association between Wi (s) and log hazard is modeled through a fixed unknown parameter
β(s), which is the same across different values of Wi (s). In this section we show how to
extend the linear functional Cox model to the case when one is interested in allowing the
effect to vary with the magnitude of Wi (s).
The additive functional Cox model [56] replaces the linear functional S Wi (s)β(s)ds
model with the functional term S F {s, Wi (s)}ds, where F (·, ·) is an unknown bivariate
smooth function to be estimated. The model becomes
log λi {t|Zi , Wi (·)} = log{λ0 (t)} + Zti γ + F {s, Wi (s)}ds . (7.12)
S
To ensure identifiability, we impose the constraint E[F {s, W (s)}] = 0 for each s ∈ S;
see [56] for the detailed discussions on identifiability. The functional term of the additive
functional Cox model can be estimated using a tensor product of penalized splines, as
introduced in [198]. Specifically, denote by Bls (·), Bkw (·), l = 1, . . . , Ks , k = 1, . . . , Kx two
univariate splines basis. With this notation, the functional coefficient can be expanded as
K s Kw s w
F (·, ·) = l=1 k=1 βlk Bl (·)Bk (·), where {βlk : l = 1, . . . , Ks ; k = 1, . . . , Kw } are the
238 Functional Data Analysis with R
is a constant and the model reduces to a standard Cox regression model and the estimation
framework follows from the linear functional Cox model introduced in Section 7.3.1. Note
that, again, the sum in equation (7.13) is over all parameters βlk with the weights cijk .
Therefore, the exact same ideas used multiple times in this book can be applied here: use
the by option in the mgcv::gam to implement this regression.
The R syntax to fit an additive functional Cox model using the mgcv package is shown
below. Notice that here we use MIMS q to denote quantiles of MIMS at each minute. For
the NHANES data set, the model was fit using quantiles of MIMS instead of their abso-
lute values. This step is necessary to ensure that data reasonably fills the bivariate domain
spanned by the s and w directions; for more details on estimability and identifiability of
these models see [56]. To specify the tensor product of two univariate penalized splines,
we use the ti() function from the mgcv package. The first two arguments of ti() func-
tion, MIMS q and smat, instruct the function to specify a tensor product of two univariate
penalized splines on the domain spanned by Wi (s) and s. The by argument instructs the
package to multiply each specified term by object lmat, which corresponds to the quadra-
ture elements, qj , in the model. The bs and k arguments specify the type and number of
each spline basis, respectively. To ensure identifiability, we need to specify mc = TRUE for
the Wi (s) direction, which imposes the marginal constraint in this direction and matches
well with the identifiability constraint of this model. The rest of the syntax is the same as
that used for fitting a linear functional Cox model.
In this application, we focus on study participants who were greater than or equal to
50 years old and had no missing data for age and BMI. The data set contains 4,207 study
participants. For this specific example where the functional predictor is the minute-level
physical activity intensity value, [56] contains some plots for detailed comparisons on the
estimates before and after quantile transformation. Figure 7.11 displays the estimated sur-
face F {s, Wi (s)} of the additive functional Cox model using NHANES 2011–2014 data. The
figure was created using the vis.gam() function and the code to reproduce Figure 7.11 is
shown below.
Survival Analysis with Functional Predictors 239
FIGURE 7.11: Estimated surface F {s, Wi (s)} from the NHANES 2011-2014 data obtained
by calling mgcv::gam() using the syntax above. The value F{s, Wi (s)} decreases from green
(highest) to blue (lowest hazard of all-cause mortality).
The estimated surface indicates that lower physical activity quantile during the day and
higher physical activity quantile at night are associated with a higher hazard of all-cause
mortality. Specifically, being below the 35th percentile of physical activity intensity in the
population during the daytime (8 AM to 10 PM) is associated with a higher hazard of
mortality. This result is highly interpretable, is consistent with the lifestyle of the majority
of the population, and suggests the benefit of having sleep without major interruptions at
night and being active during the daytime. While the populations and the wearable device
protocols differ between our book and [56], it is interesting to note that the analysis results
from AFCM are quite similar.
We use the function scam [236] to conduct smoothing with monotonicity constraints.
Scam stands for “Shape constrained additive models,” not a choice of name that we endorse,
but an R function that we gladly use. This allows us to obtain a non-decreasing cumulative
baseline hazard. Below, the vector t0 contains the event times and H0 hat contains the
Breslow estimator of the cumulative hazard function. The next line uses the scam function
to obtain an increasing fit to the cumulative hazard function stored in H0 fit. Here the
option bs = "mpi" indicates that the function uses monotone increasing P-splines, while
the -1 indicates that the smooth is without an intercept. The remainder of the code chunk
indicates how to predict the smooth estimator of the cumulative hazard function at an
equally spaced grid of points between [0, 10]. Results are stored in the vector H0 prd.
The second step consists of using the functional predictors stored in the matrix X new
and the fitted survival model stored in fit to estimate the study participant-specific lin-
ear predictors. The elements in X new could be the same as the ones used in fitting the
model, or could be replaced with new trajectories simulated from a functional data gener-
ating mechanism. Here nt is the number of columns of X new, which is the dimension of
the functional predictors (study participants are stored by rows). The vector tind contains
an equally spaced grid of points with a dimension equal to the dimension of the functional
space. It is used to construct the Riemann sums that approximate the functional regression.
The data frame data sim contains the data structure where predictions are conducted. The
act mat element is the matrix of functional covariates, lmat is a matrix of dimension equal
to the dimension of X new with every entry equal to 1 / nt. The tmat is another matrix of
dimension equal to the dimension of X new, with the same rows corresponding to the vector
tind. In this case we considered the case when we only have functional predictors, though
additional terms could be added to account for standard covariates. The resulting linear
predictors are then stored in the vector eta i.
The third step uses the survival function Si {t|Zi , Wi (·)} = exp{− exp(ηi )Λ0 (t)), where
ηi is the linear predictor; see derivation of equation (7.4). In the code, the matrix Si contains
the survival function Si {t|Zi , Wi (·)} at every time point in the vector tgrid.sim. Each row
in the matrix Si corresponds to a simulated study participant and every column corre-
sponds to a particular time from baseline. The vector eta i corresponds to the simulated
linear predictor for all study participants and H0 prd is the vector containing the estimated
cumulative hazard function.
Step 4 consists of simulating survival times for each study participant based on the cal-
culated survival function. This approach uses a simulation trick to obtain samples from a
random variable with a given survival function S(·) with corresponding cdf F (·). Note that
if U ∼ U [0, 1], then X = F −1 (U ) is a random variable with survival function S(·). Indeed,
for any x ∈ R, P {F −1 (U ) ≤ x} = P {U ≤ F (x)} = F (x). The last equation can be rewritten
as P {1−F (x) ≤ 1−U } = P {S(x) ≤ 1−U } = F (x). As 1−U follows a uniform distribution
on [0, 1] (just like U ) this suggests the following algorithm for simulating random variables
with survival function S(·): (1) simulate a random variable from the uniform distribution;
and (2) identify the first x such that S(x) < U . Below we take advantage of this approach
and we start by simulating a vector of uniform random variables U of length equal to the
number of study participants. This is the 1 − U variable, but there is no need to do that
as the effect is the same. The simulated survival times are stored in the vector Ti of the
same length as the number of study participants. The for loop is a description of how to
simulate survival times using the Monte Carlo simulation trick described above.
242 Functional Data Analysis with R
Multilevel functional data are becoming increasingly common in many studies. Such data
have all the characteristics of the traditional multilevel data, except that the individual
measurement is not a scalar, but a function. For example, the NHANES 2011-2014 study
collected minute-level accelerometer data for up to 7 consecutive days from each participant.
This is a multilevel functional data set because for each participant (level-1) and each
day of the week (level-2) a function is measured (MIMS values summarized at the minute
level). Another example is the study of sleep electroencephalograms (EEGs) described in
[62], where normalized sleep EEG δ-power curves were obtained at two visits for each
participant enrolled in the Sleep Heart Health Study (SHHS) [239]. A far from exhaustive
list of applications includes colon carcinogenesis studies [207, 279], brain tractography and
morphology [98, 109], functional brain imaging through EEG [271], longitudinal mortality
data from period life tables [41], pitch linguistic analysis [8], longitudinal EEG during a
learning experiment [19], animal studies [278], end-stage renal disease hospitalizations in
the US [178], functional plant phenotype [333], objective physical activity [175, 273], and
continuous glucose monitoring [92, 270]. Each application raises unique methodological and
computational challenges that contribute to a rapidly developing area of research. This leads
to a substantial and diverse body of literature that cannot be completely addressed here.
Instead, we focus on a specific group of methods and attempt to make connections with
this vast and rapidly evolving literature. What we present is neither as exhaustive nor as
inclusive as we would like it to be, but it is thematically and philosophically self-contained.
Compared to single-level functional data, multilevel functional data contain additional
structure induced by known sampling mechanisms. Here we define multilevel functional data
analysis (MFDA) as the analysis of functional processes with at least two levels of functional
variability. This includes nested, crossed, and longitudinal functional data structures. Such
data structures are different from traditional longitudinal data (where the individual obser-
vation is a scalar), which are analyzed using single-level functional approaches [283, 334].
They are also different from functional ANOVA [305], which allows for different functional
means of groups, but have only one source of functional variability around these means.
Multilevel functional data raise many new methodological problems that we organize in
the following three categories:
243
244 Functional Data Analysis with R
unit (e.g., study participant) and the predictors are a set of scalar variables. This
problem is increasingly common in different applications. For example, suppose
that we are interested in modeling the effects of age, gender, and day of the week
on physical activity intensity at different times of the day in the NHANES popu-
lation. The outcome may be the collection of objective physical activity functions
measured during multiple days (each day is a function, each day is a repetition
of the measurement process). Functional mixed models have been introduced to
address this type of structure for single-level [114] and multilevel functional data
[27, 207]. The general idea is to start with the traditional mixed effects model
structure and expand it to functional measurements. While a substantial litera-
ture exists for the estimation and inference in functional mixed models, we focus
on two methods for which ready-to-use R software is available: Functional Addi-
tive Mixed Models (FAMM) [263] and Fast Univariate Inference (FUI) [57].
3. Multilevel scalar-on-function regression, which involves regressing a scalar re-
sponse on a set of multilevel functional predictors. We extend the PFR approach
used to fit a functional linear model in Chapter 4 by adding random terms and
introduce the longitudinal penalized functional regression (LPFR) [103] method.
FIGURE 8.1: Physical activity data structure in NHANES 2011-2014. Each study partici-
pant is shown in one column and each row corresponds to a day of the week from Sunday
to Saturday. The x-axis in each panel is time in one-minute increments from midnight to
midnight.
Wim (s) = Xim (s) + im (s) = µ(s) + ηm (s) + Ui (s) + Vim (s) + im (s), (8.1)
where µ(s) is the population mean function, ηm (s) is the mth visit-specific shift from µ(s),
Ui (s) is the ith subject-specific deviation, Vim (s) is the mth visit-specific residual deviation
from Ui (s), and im (s) is a white noise process with constant variance σ2 across S. In
the NHANES study, µ(s) and ηm (s) are treated as fixed functions, which is a reasonable
assumption since NHANES contains over 12,000 study participants with multiple days of
data. The random functions Ui (s) capture the between-subject variation and are modeled
as mean 0 Gaussian Processes with covariance function KU (s, t) = cov{Ui (s), Ui (t)}. The
random functions Vim (s) capture the within-subject variation and are treated as mean 0
Gaussian Processes with covariance function KV (s, t) = cov{Vim (s), Vim (t)}. We further
assume that Ui (s) and Vim (s) are mutually uncorrelated. Model (8.1) is the “hierarchical
functional model” introduced in [208], which is a particular case of the “functional mixed
models” introduced in Section 8.3.
Define KX (s, t) = cov{Xim (s), Xim (t)} the total covariance function of the smoothed
functional data, Xim (s), which satisfies KX (s, t) = KU (s, t) + KV (s, t). Since the between-
subject covariance function KU (s, t) is continuous symmetric non-negative definite, Mercer’s
(1) (1) (1)
theorem [199] ensures the eigendecomposition KU (s, t) = k≥1 λk φk (s)φk (t), where
(1) (1)
λ1 ≥ λ2 ≥ · · · ≥ 0 are non-negative eigenvalues with associated orthonormal eigenfunc-
(1) (1) (1)
tions φk (s). That is, S φk1 (s)φk2 (s)ds = 1{k1 =k2 } where 1{·} is the indicator function. It
(1)
follows from the Kosambi-Karhunen–Loève (KKL) theorem that Ui (s) = k≥1 ξik φk (s),
where ξik is the score of the ith subject on the kth principal component with mean 0
(1)
and variance λk . Similarly, the within-subject covariance function has eigendecomposition
(2) (2) (2) (2) (2)
KV (s, t) = k≥1 λk ψk (s)ψk (t), where λ1 ≥ λ2 ≥ · · · ≥ 0 are non-negative eigenval-
(2) (2)
ues with associated orthonormal eigenfunctions ψk (s), and Vim (s) = k≥1 ζimk ψk (s),
(2)
where ζimk are scores with mean 0 and variance λk and are mutually uncorrelated. Model
(8.1) now becomes
(1)
(2)
Wim (s) = µ(s) + ηm (s) + ξik φk (s) + ζimk ψk (s) + im (s), (8.2)
k≥1 k≥1
(1) (2)
where µ(s), ηm (s), φk (s), ψk (s) are fixed functional effects and ξik , ζimk are uncorrelated
random variables with mean 0. The main advantage of this decomposition is that the prin-
cipal component decomposition at both levels substantially reduces the dimensionality of
the problem.
1. Estimate the mean µ(s) and visit-specific functions ηm (s) using univariate
smoothers under the working independence assumption. The choice of the
smoother is flexible and has small effects on the estimator when the sample
size is large, such as in the NHANES example. Popular smoothers include pe-
nalized spline smoothing [258] and local polynomial smoothing [76]. The penal-
ized spline smoothing was adopted in the R implementation of MFPCA. Denote
im (s) = Wim (s) − µ̂(s) − η̂m (s), where µ̂(s) and η̂m (s) are the estimators of
W
µ(s) and ηm (s), respectively.
2. Construct method of moments (MoM) estimators of the total covariance func-
tion KX (s, t) and the between-subject covariance function KU (s, t). If W im (s) is
the centered data obtained in Step 1, the estimator for KX (s, t) is GX (s, t) =
I Ji I
i=1 j=1 Wim (s)Wim (t)/ i=1 Mi , and the estimator for KU (s, t) is GU (s, t) =
n I
2 i=1 m1 <m2 Wim1 (s)Wim2 (t)/ i=1 Mi (Mi −1). Note that GX (s, t) is not an
unbiased estimator of KX (s, t) on the main diagonal (s = t) due to the mea-
surement error, im (s). Indeed, cov{Wim (s), Wim (t)} = cov{Xim (s), Xim (t)} +
σ2 1{s=t} . For KU (s, t) this is not a problem, since cov{Wim1 (s), Wim2 (t)} =
cov{Xim1 (s), Xim2 (t)}.
3. Obtain the smooth estimators K X (s, t) and K U (s, t) of KX (s, t) and KU (s, t),
respectively. This step is achieved by applying a bivariate smoother to the off-
diagonal elements of G X (s, t) and the entire matrix of G U (s, t) obtained from Step
2, respectively. The idea of dropping diagonal elements of ĜX (s, t) was introduced
by [283] to avoid the extra measurement error variance along the diagonal. The
choice of bivariate smoother in this step is flexible including thin plate penalized
splines [313] and fast bivariate penalized splines [330]. Finally, the within-subject
covariance estimator of KV (s, t) is obtained from K V (s, t) = K X (s, t) − K U (s, t).
To ensure that covariance estimators are positive semi-definite, their negative
eigenvalues are truncated to 0; see, for example, [336].
4. Conduct eigenanalysis on K (1) and φ(1) (s), and on K
U (s, t) to obtain λ V (s, t)
k k
(2) (2)
to obtain λ k and ψk (s). In practice, this is achieved by performing eigende-
compositions on the estimated covariance matrices. Computational details can
be found in Section 2.1. An important decision is how to choose the number of
eigenfunctions at each level. A simple approach is to use the percent of explained
variance. Specifically, denote by P1 the proportion of variability explained. The
(1)
number of level-1 components is chosen as N1 = min{l : ρl ≥ P1 }, where
(1) l (1) (1)
ρl = k=1 λk / k≥1 λk is the proportion of the estimated variance using
the first k components. The number of components at level-2, N2 , is chosen sim-
ilarly.
5. Estimate the measurement error variance σ2 . Given the smoothed estimates
K X (s, s) obtained in Step 3 and the estimator G X (s, s) obtained in Step 2, the
error variance is estimated as σ 2
= S {GX (s, s) − K X (s, s)}ds.
6. Predict the principal component scores ξik and ζimk using Markov Chain Monte
Carlo (MCMC) or best linear unbiased prediction (BLUP). The score prediction is
a more challenging problem in MFPCA than in the single-level FPCA. For single-
level FPCA, the scores can be obtained by direct numerical integration, which
is easy to implement. However, this approach cannot be directly extended to
(1)
multilevel functional data because the functional bases at the two levels {φk (s)}
(2)
and {ψk (s)} are not mutually orthogonal. MFPCA addresses this problem using
248 Functional Data Analysis with R
mixed effects model inference. Indeed, assume that we have obtained the estimates
(1) (2) (1) (2)
of µ(s), ηm (s), N1 , N2 , λk , λk , φk (s), ψk (s), σ2 following Steps 1-5 above. The
MFPCA model becomes
N1
N2
(1) (2)
Wim (s) = µ(s) + ηm (s) + ξik φk (s) + ζimk ψk (s) + im (s) , (8.3)
k=1 k=1
(1) (2)
where ξik ∼ N (0, λk ) and ζimk ∼ N (0, λk ) can be viewed as mutually indepen-
dent random effects and im (s) ∼ N (0, σ2 ) are mutually independent residuals. If
the number of principal components at level 1 and 2 is relatively small, the mixed
effects model (8.3) remains relatively simple. The assumption of independence of
random effects further reduces the methodological and computational complexity
of the model. The Gaussian distribution assumption of scores (random effects)
and errors is convenient, though other distributions could be assumed. As model
(8.3) is a linear mixed model where ξik and ζimk are random effects, standard
mixed model inferential approaches can be used, including Bayesian MCMC [62]
and BLUP [63].
The MFPCA is one of the simplest multilevel functional models. It also provides an ex-
plicit decomposition of variability using latent processes based on the same philosophy used
in traditional mixed effects modeling. Understanding its structure and designing reasonable
inferential approaches can help in more complex situations. An important characteristic of
MFPCA is that each inferential step is practical and does not require highly specialized soft-
ware. For example, we can use any type of univariate smoother to estimate the population
mean function in Step 1 or bivariate smoother to estimate the covariance functions in Step
3. We have not identified major differences in performance between reasonable smoothers.
However, here we focus on non-parametric smoothers obtained from a rich spline basis plus
appropriate quadratic penalties.
8.2.1.3 Implementation in R
We now show how to implement MFPCA for the NHANES data set in R. As introduced in
Section 8.1, physical activity data was collected for each participant in every minute of each
eligible day. Figure 8.2 displays the data storage format in R. The data frame nhanes ml df
consists of 79,910 rows and 6 columns. Each row represents the data for one day of the
week for one study participant. The total number of days is 79,910 in this analysis. The
SEQN column stores the unique study participant identifier in a vector. Notice that SEQN is
repeated for each row (day) that corresponds to the same study participant. The dayofwear
column stores the day-of-wear information in a vector, where, for example, a value of 2
corresponds to the second day of wearing the device. The dayofweek column stores the
day of the week information in a vector, where “1” represents “Sunday,” “2” represents
“Monday,” . . . , and “7” represents “Saturday.” For each study participant, the starting day
of the week of wearing the device is not necessarily “1” (Sunday). The MIMS column stores
the physical activity data in a matrix with 1,440 columns, each column corresponding to
a minute of the day starting from midnight. This is achieved using the I() function. The
column names are "MIN0001", "MIN0002", . . . , "MIN1440", representing the time of day
from midnight to midnight. For example, for study participant SEQN 62161 on Sunday, the
MIMS value at 12:00-12:01 AM is stored in the first column (column name "MIN0001") of
the first row of MIMS matrix. For each study participant, the age and gender information
are stored in age and gender columns, respectively. We will use this format to store and
model multilevel functional data in R throughout this chapter.
Multilevel Functional Data Analysis 249
FIGURE 8.2: An example of the multilevel data structure in R, where accelerometry data
is a matrix of observations stored as a single column of the data frame.
The code below shows how to implement MFPCA on this NHANES data set to de-
compose the within-subject and between-subject variability of physical activity using the
mfpca.face() [58] function from the refund package. This function complements and sub-
stantially improves the previous function mfpca.sc() [62], especially from a computational
perspective. The new function allows the use of much higher dimensional functions by incor-
porating fast covariance estimation (FACE, [331]) into the estimation; see [58] for technical
details. For example, fitting MFPCA using mfpca.face() on the NHANES data takes less
than a minute on a standard laptop, a remarkable reduction from the function mfpca.sc()
[62], which took several days (computation was actually stopped after 24 hours).
FIGURE 8.3: Estimated population mean function µ(s) and day-of-the-week-specific mean
function µ(s) + ηm (s) in the NHANES 2011-2014 dataset using MFPCA. The population
mean function is shown as a black solid line. The weekend-specific (Saturday, Sunday) curves
are shown as dashed lines. The weekday-specific curves are shown as dotted lines.
FIGURE 8.4: The first three estimated level-1 principal components in the NHANES 2011-
2014 dataset using fast MFPCA. The proportion of variance explained out of the level-1
variance by each component is shown in the title of each panel.
Multilevel Functional Data Analysis 251
FIGURE 8.5: The average physical activity trajectories by tertile of the scores of the first
three estimated level-1 principal components. For each panel, the red curve represents the
average across individuals whose scores are in the first tertile, blue curve represents the
average across individuals whose scores are in the second tertile, orange curve represents
the average across individuals whose scores are in the third tertile.
The interpretation is that people with positive scores on this component are more active
during the day and less active during the night. The second level-1 principal component
explains 18.60% of the variability at this level and is negative only between 5 AM and 12
PM. People with positive scores on this component are less active in the morning and more
active at other times of a day. The third level-1 principal component explains 6.26% of the
variability at this level and is negative before 10 AM and after 8 PM and positive between
10 AM and 8 PM. People with positive scores on this component have lower activity during
the night and early morning period, which could correspond to disrupted sleep, substan-
tial changes from regular sleep hours, or night-shift work. The first three level-1 principal
components explain roughly 80% of the variability at this level.
To further illustrate these estimated principal components in our data, Figure 8.5 shows
the average physical activity curves by tertile of the scores of each principal component (the
first PC to the third PC from left to right). For example, in the left panel, we partition
the scores on the first PC into tertiles. The average minute-level physical activity intensity
across individuals whose scores are in the first, second, and third tertile are shown in red,
blue, and orange, respectively. From the left panel, we observe higher physical activity
intensity during the day and lower at night for people whose scores are in the third tertile.
These results are consistent with the left panel in Figure 8.4.
Similarly, the middle panel shows that individuals with scores in the highest tertile for
the second level-1 PC (orange curve) tend to be more active in the morning with a slow
decrease during the day and a very fast decrease in the evening. Study participants who
are in the lowest tertile for the second level-1 PC (red curve) are, on average, less active
in the morning but more active in the second part of the day compared with individuals
with scores in the highest tertile for the second level-1 PC. Moreover, they tend to be more
active during the night, which may indicate substantial change in their circadian patterns
of activity.
The right panel in Figure 8.5 compares the average daily activity profiles by the tertiles
of scores on the third level-1 principal component. The average profile of study participants
252 Functional Data Analysis with R
FIGURE 8.6: The first three estimated level-2 principal components in the NHANES 2011-
2014 dataset using fast MFPCA. The proportion of variance explained out of the level-2
variance by each component is shown in the title of each panel.
in the highest tertile (orange) tends to be lower during the night, with a more rapid increase
in physical activity intensity after 6 AM, generally higher PA between 10 AM and 6 PM,
and a more pronounced decline after 8 PM.
Variability within level-2 principal components is much more evenly spread out and
does not have the same quick drop in variance explained exhibited by level-1 principal
components. This could be due to the fact that day-to-day differences may, at least in part,
be due to de-synchronized physical activities. For example, a person may brush their teeth
at 7 AM one morning and at 7:30 AM another. As we have discussed in Section 3.4 de-
synchronization of functional data could lead to reductions in interpretability of principal
components and slow decrease in the variance explained. In some sense, this is reasonable
and indicates that the average subject-specific daily PA trajectories can be classified in a
relatively small number of subgroups. In contrast, that may not be possible for the day-to-
day variations.
Indeed, the first three level-2 principal components explain only 12.27%, 9.76%, and
7.09% of the variability at level 2, respectively. It is still instructive to plot these components
and interpret them. The interpretation of level-2 principal components is different from that
of level-1 principal components. Indeed, first level is focused on the variability patterns of
activity between individuals, while the second level quantifies the variability from day to
day within an individual. Notice, for example, that the first level-2 principal component is
positive between 6 AM and 12 AM. This indicates that on days when a person has higher
scores on this component, the individual is more active during that day and less active
during that night compared to their average.
models where SFPCA is applicable. In this section, we focus on two more complex sampling
scenarios: two-way crossed and three-way nested designs.
Wim (s) = µ(s) + ηm (s) + Ui (s) + Vim (s) + im (s). (8.4)
In Section 8.2.1, we assumed that ηm (s) is a fixed effect function and was estimated
by smoothing the difference between the visit-specific average function and the population
average function. This choice makes sense when the number of functional observations
per study participant, Mi , is small and measurements are obtained after varying only one
experimental condition (e.g., measuring physical activity data for multiple days within one
individual). When multiple experimental conditions are changed there may be interest in
analyzing the amount and patterns of variability between and within each experimental
condition. For example, in clinical trials individuals can be repeatedly “crossed-over” from
one treatment to another. In task fMRI studies [181], brain activity is monitored for study
participants using repetitions of different tasks and rest. In a phonetic study described in
[8, 273], the fundamental frequency (F0) of syllables from 19 nouns (experimental condition
1) were recorded from 8 native speakers (experimental condition 2) of the Luobozhai Qiang
dialect in China.
To account for such data structures, model (8.4) can be modified to become a two-way
crossed design if we assume that ηm (s) and Ui (s) are random. For notation consistency, we
replace ηm (s) with Zm (s) to denote a random effect function. The model can be rewritten
as
Wim (s) = µ(s) + Ui (s) + Zm (s) + Vim (s) + im (s), (8.5)
where Ui (s) and Zm (s) are two uncorrelated processes with interaction Vim (s). Model (8.5)
is the functional equivalent of the two-way crossed-design random effects ANOVA model
[152]. We use the name “crossed design” because the model allows crossing of the levels.
Unsurprisingly, such modeling extensions create additional technical challenges. We now
provide the technical details associated with fitting the crossed-designed functional model.
Since the fixed effect µ(s) can be estimated by smoothing the population mean function,
without loss of generality we assume that the data are demeaned and focus on the random
effects. For simplicity, we consider a noise-free model Wim (s) = Ui (s) + Zm (s) + Vim (s).
The solution for accounting for the noise component, im (s), is similar to what was done in
the MFPCA approach and is described in detail in [273]. Denote KU (s, t) = E{Ui (s)Ui (t)},
KZ (s, t) = E{Zm (s)Zm (t)}, KV (s, t) = E{Vim (s)Vim (t)} as the covariance operators of
mutually correlated mean-zero random processes Ui (s), Zm (s), Vim (s), respectively. De-
note by φU Z V
k (s), φk (s), φk (s) the eigenfunctions of the corresponding covariance operators
KU , KZ , KV . Using the Kosambi-Karhunen–Loève (KKL) expansion, the model becomes
U U Z
Wim (s) = ξik φk (s) + ξmk φZ
k (s) +
V
ξimk φVk (s) , (8.6)
k≥1 k≥1 k≥1
U Z V
where ξik , ξjk , ξijk are mutually independent random variables with mean 0 and variances
Z V W
λk , λk , λk , respectively
The main differences between a two-way crossed design SFPCA and MFPCA are (1) the
function ηm (s) (Zm (s) after change of notation) is assumed to be a fixed effect in MFPCA
and a random effect in SFPCA; and (2) MFPCA has two covariance operators, whereas
two-way crossed design SFPCA has three. One major contribution of SFPCA was to show
254 Functional Data Analysis with R
that all MoM estimators of these covariance operators have a “sandwich” form. To illustrate
this idea, we next introduce some technical, but necessary, matrix notation.
Denote by W = (W n11 , . . . , W1M1 , . . . , Wn1 , . . . , WnMn ) the p × N matrix obtained by
column binding N = i=1 Mi p × 1 dimensional vectors Wim = {Wim (s1 ), . . . , Wim (sp )}t .
For the noise-free model, we have
K UZ − H
Z = (H U )/2 := WGZ Wt ;
K UZ − H
U = (H Z )/2 := WGU Wt ; (8.9)
K Z +H
V = (H U −H ZV )/2 := WGV Wt ,
which all have the “sandwich” form, WGWt . The matrix G for each process can be easily
obtained from (8.8).
Here we provided the formula for the two-way crossed design. For more general multi-way
crossed designs, the SFPCA paper [273] showed that the MoM estimators of covariance op-
erators can always be written in “sandwich” form. After obtaining the covariance estimates,
the eigenanalysis and score calculation of SFPCA follow the same principles as MFPCA.
within colonic crypts of rats that were further nested within diet groups. In a US study
of hospitalization rates [178], dialysis hospitalizations were nested within dialysis facilities,
which were further nested within geographic regions.
To model such data structures we introduce the three-way nested design. The de-meaned
noise-free model is
Wimk (s) = Ui (s) + Zim (s) + Vimk (s) , (8.10)
where i = 1, . . . , n, m = 1, . . . , Mi , k = 1, . . . , Kim , and Ui (s), Zim (s), and Vimk (s) are
three latent uncorrelated processes. Here we focus on the covariance estimation, as all other
model fitting steps are similar to MFPCA. Note that
E[{Wimk (s) − Wluv (s)}{Wimk (t) − Wluv (t)}] (8.11)
2KV (s, t) i = l, m = u, k = v;
= 2{KV (s, t) + KZ (s, t)} i = l, m = u; (8.12)
2{KV (s, t) + KZ (s, t) + KU (s, t)} i = l.
Let HV = 2KV , HZ = 2(KV + KZ ), HU = 2(KV + KZ + KU ).
Denote by W = (W111 , . . . , W11K11 , . . . , WnMn 1 , . . . , WnMn KnMn ) the p × N matrix
n Mi
obtained by column binding N = i=1 im p × 1 dimensional vectors Wijk =
m=1 K
Mi
{Wijk (s1 ), . . . , Wijk (sp )}t , Ni· = m=1 Kim , k1 = i,m Kim 2
, k2 = i Ni·2 . We then define
D1 = diag{K11 , . . . , KnMn } where Kim = Kim IKim , D2 = diag{N1 , . . . , Nn } where Ni =
Ni· INi· , E1 = diag{1tK11 , . . . , 1tKnMn }, E2 = diag{1tN1· , . . . , 1tNn· }. With this notation, the
MoM estimators can be written as
1
V =
H (Wimk − Wimv )(Wimk − Wimv )t
k1 − N i,m
k=v
2
= W(D1 − Et1 E1 )Wt ,
k1 − N
1
Z =
H (Wimk − Wiuv )(Wimk − Wiuv )t
k2 − k1 i
m=u k,v
(8.13)
2
= W(D2 − Et2 E2 − D1 + Et1 E1 )Wt ,
k2 − k1
1
U =
H (Wimk − Wluv )(Wimk − Wluv )t
N 2 − k2
i=l m,u,k,v
2
= 2 W(N IN − 1N 1tN − D2 + Et2 E2 )Wt .
N − k2
Hence, we obtain K V = H V /2, K
Z = (HZ − H V )/2, and K U = (HU − H Z )/2. SFPCA
also provided the formula of covariance estimators for general multi-way nested designs; see
Appendix of [273] for details. The R code to fit a three-way nested model using SFPCA is
available in the supplementary material of [273]. This R code is not deployed as software and
may require additional work for specific applications. However, the modeling infrastructure
exists.
FIGURE 8.7: NHANES data with a multilevel functional mixed effects model (FMM) struc-
ture. Each column displays information for one participant, including physical activity data
collected on each day of week. The day of wear is also displayed in each panel. For example,
“Day 3” indicates the third day when the device was worn and could fall on any day of
the week (e.g., Tuesday for study participant 62161 and Wednesday for study participant
83727). Age and gender of study participants are shown in the last row.
fixed effect that accounts for other variables (e.g., age, gender) and a complex structure
of the residual functional variance (e.g., nesting of functional curves within study partici-
pants). Figure 8.7 provides an example, where the NHANES data has a multilevel functional
structure and additional covariates. The last row displays two of the covariates, age and
gender, for study participants. For example, participant SEQN 62161 was a 22-year-old male
at the time of the study. In addition to the physical activity data displayed in Figure 8.1,
NHANES also collected day-of-wear information for each individual. For example, partic-
ipant SEQN 62163 started to wear the device on Sunday (“Day 1” displayed on Sunday’s
physical activity profile).
Functional mixed models (FMMs) are extensions of mixed effects models to functional
data. They provide a useful framework that allows the explicit separation of different sources
of observed variability: (1) fixed effects that may depend on the functional index (e.g., time);
and (2) functional random effects that have a known structure (e.g., nested or crossed
Multilevel Functional Data Analysis 257
within the same sampling unit). The functional ANOVA model introduced in Section 8.2 is
a particular case of FMM, where the visit indicator has a fixed time-varying effect and the
functional residuals have a two-level nested structure.
Assume that the functional data are of the type Wim (s) on an interval S, where i =
1, . . . , n is the index of the study participant and m = 1, . . . , Mi is the index of the visit. In
this section we focus on two-level functional data, the simplest multilevel functional data.
Let Xim = (Xim1 , . . . , XimQ )t be the Q fixed effects variables and Zim = (Zim1 , . . . , ZimR )t
be the R random effects variables. The FMM can be written as
where β(s) = {β1 (s), . . . , βQ (s)}t are Q fixed effects functions and ui (s) =
{ui1 (s), . . . , uiR (s)}t are the R random effects functions corresponding to the ith study
participant.
The structure of the residuals im (s) should be informed by and checked on the data.
For example, if data can be assumed to be independent after accounting for the subject-
specific mean, im (s) can be assumed to be independent. However, in many studies that
contain different levels of functional variability, this assumption is too stringent. Let us
consider again the NHANES data and consider the de-meaned data Wim (s) − W i· (s), where
Mi
s = 1, . . . , 1440 and W i· (s) = m=1 Wim (s) is the mean over Mi visits for study participant
i. Figure 8.8 displays these de-meaned visit-specific functions for three NHANES study
participants i ∈ {62161, 62163, 83727}. Here we abused the notation a bit, as i runs between
1 and the total number of study participants, whereas here we refer to the specific NHANES
subject identifier. Alas, we hope that this is the biggest problem with our notation.
Note that even after subtraction, the functions still exhibit substantial structure and
correlation patterns across the domain functional, especially during nighttime and daytime.
FIGURE 8.8: The visit-specific functions for three NHANES study participants after sub-
tracting the subject-specific means. Each panel displays information for one participant.
Within each panel, each line represents the function on one day of week.
258 Functional Data Analysis with R
Therefore, in the NHANES example, one cannot simply assume that after removing the
study participant mean, the residuals are independent.
The outcome in (8.14) is a function, Wim (s), and the component Xtim β(s) could be
viewed as the “scalar regression” structure. Indeed, the components of Xim are scalar pre-
dictors (e.g., age and gender). For this reason, the model (8.14) can be viewed as an example
of function-on-scalar regression [251]. An important difference here is that Wim (s) is indexed
both by i (e.g., study participant) and m (e.g., visit within study participant). Just as in
standard mixed effects models, the component Ztim ui (s) captures the within-person vari-
ability and its structure follows the known sampling mechanisms. As discussed earlier, a
major difference between model (8.14) and standard mixed effects models is that one cannot
assume that im (s) are independent across s ∈ S. Thus, the FMM model (8.14) is sometimes
referred to in the literature as the “multilevel function-on-scalar regression model.”
As discussed in Chapter 5, model (8.14) with independent im (s) was known under dif-
ferent names and was first popularized by [242, 245] who introduced it as a functional linear
model with a functional response and scalar covariates; see Chapter 13 in [245]. In Chapter 5
we used the FoSR nomenclature introduced by [251, 253], which refers directly to the type
of outcome and predictor. It is difficult to pinpoint where these models originated, but they
were first introduced as linear mixed effects (LME) models for longitudinal data [161], and
then as functional models in recognition of new emerging datasets with denser and more
complex sampling mechanisms [27, 79, 114, 243, 244]. The model with correlated im (s) was
likely introduced by [207], though other papers contained multilevel functional structures.
Many methods have been developed to estimate fixed and random effects of FMM, such as
[57, 106, 114, 207, 263]. Here we focus on two approaches with well-documented and easy-
to-use software implementations, namely the Functional Additive Mixed Model (FAMM,
[263]) and Fast Univariate Inference (FUI, [57]).
Wim (s) = β0 (s) + Xim β1 (s) + ui0 (s) + Zim ui1 (s) + im (s). (8.15)
between the expansion of the functional fixed and random effects, respectively. This is not
strictly necessary and adds precision, but makes notation particularly messy.
With these choices the model becomes
β β
K0 K1
Wij (s) = β0k B0k (s) + β1k Xim B1k (s)+
k=1 k=1
u u
(8.16)
K0 K1
u u
di0k B0k (s) + di1k Zij B1k (s) + ij (s).
k=1 k=1
To avoid overfitting, a smoothing penalty is imposed on the spline coefficients for each
function. The form of the penalty matrix varies by model assumptions and was discussed
in [263]. The large number of basis functions and penalties make the estimation of (8.16)
challenging. Just as with other approaches presented in this book, the approach to fitting is
based on the equivalence between penalized splines and mixed models ([258]). Specifically,
the spline coefficients β0k , β1k , di0k , di1k in model (8.16) can be viewed as random effects in
a mixed model; see Section 2.3.3. Therefore, model (8.16) can be fit using computational
approaches for mixed effects models.
However, fitting (8.16) is not easy as the design matrix of (8.16) has a structure that
cannot be simplified. Although FAMM is computationally infeasible for large data sets such
as NHANES, in Section 8.3.3 we introduce the pffr() function in refund and provide the
syntax for the NHANES application example. However, in many other applications with
smaller data sets, FAMM is feasible, provides exceptional modeling flexibility, and provides
a principled approach to inference in this context.
Wim (sj ) = β0 (sj ) + xim β1 (sj ) + ui0 (sj ) + zim ui1 (sj ) + im (sj ) .
The major difference between this model and formula (8.15) is that here sj is a
specific location. Therefore, it is just a mixed model that can be fit using any
software, such as the lme4::lmer() function [10]. For non-Gaussian data, the
pointwise LMM estimate is replaced by the pointwise GLMM estimate. Denote
the estimated fixed effects as βl (s1 ), . . . , βl (sp ), l = 0, 1 and random effects as
il (s1 ), . . . , u
u il (sp ), l = 0, 1.
2. Smoothing the estimated fixed effects and random effects along the functional
domain. The choice of smoothers is flexible, including not smoothing and simply
taking the average along the domain. The smoothed estimators are denoted as
{βl (s), s ∈ S} and {
uil (s), s ∈ S}, respectively.
3. For Gaussian data, the pointwise and CMA confidence intervals for functional
fixed effects can be obtained analytically. For both Gaussian and non-Gaussian
data, inference could be conducted using the nonparametric bootstrap. Building
prediction confidence intervals for functional random effects was still work in
progress at the time of writing this book.
The key insight of FUI is to model between-subject and within-subject correlations
separately. This marginal approach has several computational advantages. For example,
fitting massively univariate LMMs in step 1 and bootstrapping inference in step 3 can be
easily parallelized. Another advantage is that the method is “read-and-use,” as it can be
implemented by anyone who is familiar with mixed model software. In Section 8.3.3, we
show how FUI can be implemented using the fastFMM::fui() function for the NHANES
example.
This is a simplified implementation of FAMM, which assumes that im (s) are uncor-
related across time s, which is a strong assumption. Indeed, in NHANES this essentially
assumes that PA observations within a day are independent after removing the participant-
specific mean. Unfortunately, even with the simplified (misspecified) model, FAMM could
not run on the entire NHANES data set. One could likely use sub-sampling approaches and
then combine the resulting fits by taking averages. The performance of such an approach is
currently unknown.
To test the feasibility of this idea, FAMM was used on sub-samples of the NHANES
data. This took around 2 minutes for 100 participants and 15 minutes for 200 participants
using a standard laptop. However, for a sample of 1,000 participants, computation time
increased substantially to over 6 hours (program was stopped after 6 hours due to memory
problems). Thus, FAMM cannot be fit using existing software on the NHANES data set.
More details on the computational limitations of current methods can be found in [57].
However, FUI worked well on the entire data set using a standard laptop, and took less
than 2 minutes to obtain the point estimates. This is due to the fact that FUI fits 1440
univariate linear mixed models. This is computationally convenient and could be easily par-
allelized to further reduce computational time. For the jth minute of the day, the code to fit
a linear mixed model using the lme4::lmer() function and extract fixed effects estimates
is shown below.
The point estimates obtained from these massively univariate mixed effects models can
be smoothed using a wide variety of smoothing approaches. This is an advantage of the FUI
approach, as it can be implemented by anyone who is familiar with mixed models.
Instead of manually implementing the FUI method step-by-step as shown above, the
fastFMM::fui() function provides an integrated way of fitting FUI for the NHANES data
set. The syntax is shown below.
Figure 8.9 displays the estimated functional fixed effects in NHANES together with
the 95% pointwise unadjusted (darker gray shaded area) and correlation and multiplicity
adjusted (CMA) (lighter shaded gray area) confidence intervals. Each panel denotes the
effect of one continuous variable or one level of a categorical variable compared to baseline.
For this younger population, as age increases there is a significant increase in physical
activity in the morning. Compared to men, women are more active during the day, and
less active at night. This is in direct contrast to the most cited paper in the field [296], but
consistent with a growing body of literature. Compared to Sundays (the reference category),
people have lower levels of activity on weekdays during the predawn hours (12 AM to
262 Functional Data Analysis with R
FIGURE 8.9: Estimated functional fixed effects in the NHANES case study using Fast Uni-
variate Inference (FUI). Smoothed estimates are denoted using black solid lines. Pointwise
unadjusted and correlation and multiplicity adjusted (CMA) 95% confidence intervals are
shown as the dark and light gray shaded areas, respectively.
4 AM) and much higher levels of activity in the morning (6 AM to 11 AM). These results
are highly interpretable and are consistent with previous findings using a different NHANES
data set (NHANES 2003–2006) [57].
participant was collected up to December 31, 2019. In addition, demographic variables such
as age, gender, BMI were collected at baseline and their values are invariant with the day
of wearing the device.
Given that the physical activity data were collected across multiple days, one simple
solution is to take the average at each minute of a day and model the compressed data
as independent functional predictors using PFR. However, taking the average may reduce
the predictive performance of functional covariates. The Generalized Multilevel Functional
Regression (GMFR, [52]) provides a solution for such data structures.
Assume that for the ith subject the observed data are {Yi , Zi , {Wim (s), s ∈ S, m ∈
{1, . . . , Mi }}}, where Yi is a continuous or discrete scalar outcome, Zi is a vector of co-
variates, and Wim (s) is the functional predictor at the mth visit. For simplicity we only
introduce one functional predictor, though the estimation procedure can be easily general-
ized to multiple functional predictors. The GMFR model is
Wim (s) = µ(s) + ηm (s) + Xi (s) + Uim (s) + im (s),
Yi ∼ EF(µi , φ), (8.17)
t
1
g(µi ) = Zi γ + 0 Xi (s)β(s)ds.
Here EF(µi , φ) denotes an “exponential family” distribution with mean µi and variance
dispersion parameter φ. A closer look at model (8.17) reveals that it is a combination of
multilevel variability decomposition and scalar-on-function regression. The first equation is
similar to the two-level functional principal component analysis introduced in Section 8.2.1,
where a subject-specific deviation is denoted as Xi (s). The same term Xi (s) also appears
in the last equation, where we treat it as a functional predictor in a scalar-on-function
regression model.
Fitting (8.17) is not straightforward. One solution is to first conduct MFPCA on the
multilevel functional predictor, replace Xi (s) with its estimate, and fit a standard scalar-on-
function regression model. However, this two-stage procedure ignores the variability from
MFPCA, which may induce bias when fitting the functional regression model. An alternative
is to conduct a joint analysis using, for example, Bayesian posterior simulations. Detailed
discussions on pros and cons of both methods can be found in [52].
where “EF(µij , φ)” denotes an exponential family distribution with mean µij and dispersion
φ. The parameters β are the fixed effects to be estimated, and Ztim bi is a standard random
effects component where bi ∼ N (0, Σb ), where 0 is a vector of zeros and Σb is the covariance
of the random effects bi . The functional effects are quantified as βr (s) for the rth functional
predictor.
In LPFR, the longitudinal correlation is modeled using random effect terms. Similar to
PFR, functional terms are decomposed as a summation of weighted spline basis functions.
264 Functional Data Analysis with R
Therefore, formula (8.18) reduces to a generalized linear mixed model. The LPFR model
is implemented in the refund::lpfr() function. An example based on a diffusion tensor
imaging (DTI) in Multiple Sclerosis and healthy controls is provided in the refund package.
Another data structure was discussed in [177], who considered the case when the out-
comes and functional predictors are observed longitudinally, but not at the same visits
(asynchronously). Such data structures are increasingly prevalent in large observational
studies.
9
Clustering of Functional Data
265
266 Functional Data Analysis with R
FIGURE 9.1: Each line represents the weekly excess mortality per one million residents for
each state in the US and two territories (District of Columbia and Puerto Rico). Five states
are emphasized: New Jersey (green), Louisiana (red), Maryland(blue), Texas (salmon), and
California (plum).
Before going into technical details, consider the example of weekly excess mortality in
2020 for all states and two territories (District of Columbia and Puerto Rico) introduced in
Chapter 1. Recall that for every week, the data represent the difference in total mortality
between a specific week in 2020 and the corresponding week in 2019. Excess mortality is
divided by the state population and multiplied by one million. Therefore, the resulting data
are the weekly excess mortality per one million residents. Figure 9.1 displays these data,
where each line corresponds to a state or territory. For presentation purposes five states
are emphasized: New Jersey (green), Louisiana (red), Maryland(blue), Texas (salmon), and
California (plum). The difference from Chapter 1 is that we are looking at the weekly excess
mortality and not the cumulative excess mortality.
A quick inspection of Figure 9.1 indicates that (1) for long periods of time, many states
seem to have similar excess mortality rate patterns (note the darker shades of gray that form
due to trajectory overlaps in the interval [−50, 50] excess deaths per one million residents);
(2) some states, including New Jersey, Louisiana, and Maryland, have much higher peaks in
the April–June, 2020 period with weekly excess mortality above 50 per one million residents;
(3) some states, including Texas and Louisiana, have higher peaks in the July–August, 2020
periods with weekly excess mortality between 50 and 100 per one million residents; and (4)
most states have a higher excess mortality in December 2020, with some states exceeding
100 excess deaths per week per one million residents. California (plum), which for most of
the year tracked pretty closely with the median excess mortality, has a larger jump towards
the end of the year.
Thus, visual inspection of such data suggests how the idea of “clustering” appears nat-
urally in the context of functional data analysis. In some cases one can clearly observe
different characteristics of a subgroup (e.g., states with a large increase in weekly excess
Clustering of Functional Data 267
FIGURE 9.2: Each panel contains weekly excess mortality for each state in the US for one
week (x-axis) versus another week (y-axis). Each dot is a state or territory. Weeks shown
are 10 (ending March 7, 2020), 20 (ending May 16, 2020), 30 (ending July 25, 2020), 40
(ending October 3, 2020), and 50 (ending December 12, 2020).
mortality rate between April and June). In other cases, one may wonder whether additional
subgroup structure may exist even when none is obvious on a line plot. Indeed, plots such
as shown in Figure 9.1 can obscure temporal patterns simply due to over-plotting in areas
of high density of observations.
To further explore the data structure, Figure 9.2 provides a different perspective on the
same data shown in Figure 9.1. More precisely, each panel contains weekly excess mortality
for each state in the US for one week (x-axis) versus another week (y-axis). Each dot is a
state or territory. Weeks shown are 10 (ending March 7, 2020), 20 (ending May 16, 2020),
30 (ending July 25, 2020), 40 (ending October 3, 2020), and 50 (ending December 12, 2020).
The panel columns correspond to data for weeks 10 (first column), 20 (second column), 30
(third column), and 40 (fourth column) on the x-axis. The panel rows correspond to data
for weeks 20 (first row), 30 (second row), 40 (third row), and 50 (fourth row) are shown on
the y-axis.
When week 10 is displayed on the x-axis (first panel column), almost all observations are
on the left side of the plots. This indicates that for the week ending on March 7, 2020 there
were few excess deaths in most states (mean 3.3, median 2.6) excess deaths per week per
million residents. However, one point stands out (notice the one lone point to the right of
the main point clouds). This point corresponds to North Dakota and indicates 70.6 excess
deaths per one million residents. When looking more closely at North Dakota, four weeks
before and after week 10, we see the following weekly excess mortality numbers 10.5, −17.0,
268 Functional Data Analysis with R
49.7, 10.5, 70.6, 39.2, 6.5, −30.1, −41.8. These are the data for consecutive weeks starting
with the week ending on February 8, 2020 and ending with the week ending on April 4,
2020. Week 10 (ending on March 7, 2020) is an outlier, though the end of February through
mid-March 2020, seems to correspond to an unusually high excess number of deaths in
North Dakota.
The panel in the first row and column corresponds to week 10 versus week 20. One
can visually identify several points close to the top of the point cloud. These are states
or territories that experienced a much sharper increase in excess mortality rates at the
beginning of the pandemic (week ending on April 16, 2020). In fact, for this week, there are
5 states or territories with an excess mortality rate larger than 100 per million residents: New
Jersey (104.5), Connecticut (106.3), Delaware (114.5), Massachusetts (122.1) and District
of Columbia (136.1).
The panel in the third row and column (week 30 on the x-axis and week 40 on the y-axis)
suggests the presence of two or possibly three clusters. Indeed, there seems to be a larger
group of states in the left lower corner. Another group of states have large excess mortality
rates at week 30: South Carolina (84.7), Louisiana (87.0), Texas (88.8), Arizona (107.0),
and Mississippi (119.3). The 5 states with the largest excess mortality rates at week 40 are:
Wyoming (53.2), Missouri (58.4), District of Columbia (58.9), Arkansas (72.3), and North
Dakota (75.8).
These simple exploratory techniques provide many insights, some of which were dis-
cussed and some that can be inferred from the plots. They also suggest that there may be
additional structure in the data, especially in the form of subgroups. That is, some states
may have trajectories with similar characteristics. So far, we have relied on intuition and
visual inspection and have not addressed the problem of distance between trajectories or
groups of trajectories.
9.2.1 K-means
K-means [88, 119, 183, 189] is one of the oldest and most reliable techniques for data
clustering and the function kmeans is part of the base stats library in R. In this section we
will focus first on the actual implementation and then we will describe the basic ideas and
implications for obtaining results.
FIGURE 9.3: Each line represents the weekly excess mortality per one million residents for
each state in the US and two territories (District of Columbia and Puerto Rico). Each state
is clustered in one of three subgroups using K-means: purple (cluster 1), green (cluster 2),
and orange (cluster 3). Thicker lines of the same color are the clusters means.
270 Functional Data Analysis with R
(2) an average rate of excess deaths in July–October; and (3) a higher than average rate of
excess deaths in December 2020, with peaks that came close to those experienced by states
in cluster 1 in the April–June 2020 period.
Note that even within clusters, there is substantial heterogeneity between states and
territories. However, at least as a first visual inspection, each of the three groups seem to
be more homogeneous than their combination.
It is of interest to identify which states belong to each cluster. One way is to enumerate
them, but this is quite cumbersome even with just 52 states and territories. In cases when
the number of observations is very large, alternative strategies are necessary. Here we choose
to display the clusters on the US map.
Figure 9.4 displays the US map where each state and territory is colored according to
their estimated cluster membership. Even though geographic information was not used in
the clustering approach, the map indicates that clusters tend to have strong geographic
co-localization. For example, cluster 1 (purple) includes states in the Northeastern United
States, except Maine, Vermont, and New Hampshire. However, it also includes Michigan,
Illinois, and Louisiana. A closer inspection of Figure 9.1 reveals that Louisiana had a very
high excess mortality rate both in the April–June 2020 period (like the other states in
cluster 1) and in the July–November 2020 period (like the other states in cluster 2). The
reason for this cluster assignment is likely due to the fact that the first peak of the excess
mortality rate in Louisiana is higher and closer to the mean of cluster 1. Cluster 2 (green)
contains almost the entire Southeastern region of the US, Texas, Alaska, the Pacific West,
Utah, Colorado, Arizona, as well as Maine, Vermont and New Hampshire. Cluster 3 (orange)
FIGURE 9.4: US map of three estimated clusters for weekly excess mortality rates in 2020.
The method used was K-means on the data matrix where the observed data (without
smoothing) were used. Clusters are the same as in Figure 9.3 and cluster colors were main-
tained: purple (cluster 1), green (cluster 2), and orange (cluster 3). Thicker lines indicate
cluster centers.
Clustering of Functional Data 271
contains a mix of states from the northern Intermountain Region and Midwest, except for
Illinois and Michigan.
is the mean of all vectors wi in cluster k, that is with the property that C(i) = k, and
|{i : C(i) = k}| is the number of elements in cluster k.
For a given clustering of the data, C(·), the objective function in equation (9.1) is the
sum over all clusters of the within-cluster sum of squares distances between functions and
the corresponding cluster center (mean). A possible solution of the minimization problem
could be to enumerate all partitions and calculate the sum of squares for each partition.
However, this quickly runs into computational problems as the number of observations, n,
and partitions, K, increases.
The general idea of the K-means algorithm is to: (1) start with a set of group centers;
(2) identify all functions that are closest to these centers; (3) calculate the means of these
groups; and (4) iterate.
There are many variations for each of these steps, which leads to a wide variety of K-
means clustering approaches. For example, the default function kmeans in R uses a different
heuristic for updating cluster membership. In particular, for a given clustering it searches
functions that maximally reduce the within-cluster sum of squares by moving to a different
cluster and updates the clusters accordingly. Another variation is on the metric used. For
example changing the L2 distance to L1 distance in equation (9.1) leads to K-medians and
K-medoids [132, 146]. It is worth noting that “medoids” refers to functions that exist in
the original data set, whereas medians and means might not. This could be preferred for
interpretation purposes because the cluster centers are actually functions. An alternative
would be to use K-means or K-medians and use the closest functions to the estimated cluster
centers as examples of “central behavior” within the cluster.
The distance function dist is a function that transforms the data into a matrix of
mutual distances between its rows. The default distance is "euclidian", though many
other distances can be used, including "maximum", "binary" and "minkowski". Here we
use the square of the Euclidian distance. The matrix of mutual distances is stored in the
matrix dM. The next step is to apply hierarchical clustering to the distance matrix dM using
the R function hclust. The algorithm starts by aggregating the closest functions (points in
Rp ) into clusters and then proceeds by aggregating clusters based on the mutual distance
between clusters. Given two clusters of functions (points), there are many different ways of
defining a distance between them. Here we use a method that minimizes the within-cluster
variance, indicated as method="ward.D2". We will discuss other methods for calculating
distances between clusters in Section 9.2.2.2.
The next step is to cut the tree (dendrogram) and provide the estimated clustering for a
given number of clusters; in our case, K = 5. This time we have chosen five clusters because
when we considered only three clusters, there were two small clusters, one comprising New
Jersey, one comprising North and South Dakota, and one comprising every other state.
Therefore, to obtain a split more comparable to the one obtained using K-means we opted
for K = 5.
The last line of code indicates how to obtain a plot of the hierarchical clustering shown
in Figure 1.4. To obtain a prettier plot, we have used the package dendextend [90], which
allowed for substantial customization. Please see the accompanying github repository for
details on how to make publication-ready dendrogram plots. However, when conducting
analysis, the plot(hc) option is fast and easy to use.
Another useful way to visualize the data is to combine the heatmap of the data with the
estimated hierarchical clustering of the rows (in our case, states and territories). Figure 9.6
displays the entire data set stored in the matrix Wd, where states are displayed by row
(note the labels on the right side of the plot). Colors correspond to the weekly excess
mortality rates with stronger shades of yellow corresponding to higher rates. The hierarchical
dendrogram from Figure 9.5 is shown on the left side of the plot. The colors of the estimated
clusters is preserved and are shown from left to right in Figure 9.5 and from top to bottom
in Figure 9.6. For rendering this plot we have used the functions Heatmap in the package
ComplexHeatmap [111] as well as the package circlize [112]. A problem with rendering
heatmaps is that outliers (such as the very high excess mortality rates in New Jersey at the
beginning of the pandemic), can make heatmaps appear “washed out.” To show details in
the middle of the distribution, as well as the outliers, colors need to be mapped using color
palettes with breaks that account for outliers.
In this type of hierarchical clustering, New Jersey stands out in one cluster and North
and South Dakota in another. This is different from what we have observed using K-means
clustering. We are already familiar with the trajectory of excess mortality in New Jersey,
but North and South Dakota do stand out in Figure 9.6. Indeed, for most of the year the
excess mortality rates in the two states were among the lowest in the country. However, the
excess mortality rates in both states spiked dramatically between the middle of October to
the beginning of December. The Sturgis Motorcycle Rally took place in South Dakota from
Clustering of Functional Data 273
FIGURE 9.5: Hierarchical clustering of US states and territories of weekly excess mortality
in 2020. Clusters are colored from left to right: salmon (cluster 1), yellow (cluster 2),
green (cluster 3), violet (cluster 4), and darkorange4 (cluster 5). Hierarchical clustering
was used on the observed data matrix (no smoothing) with square Euclidian distance and
Ward’s between-cluster distance.
August 7-16 in 2020. This is consistent with the characterization of this event as a super
spreader event as identified by the CDC response team [36]. Our results indicate that the
local communities (North and South Dakota) might have been hit particularly hard.
Looking more closely at cluster 1 (salmon), it seems to contain states with lower overall
weekly mortality rates during the entire year. It includes North Carolina, Vermont, New
Hampshire, Hawaii, Alaska, Maine, Puerto Rico, Virginia, Utah, Oregon and Washington.
Cluster 2 (yellow) contains two states North and South Dakota and was discussed earlier.
Cluster 3 (green) is a very large cluster and contains both states that had a higher excess
mortality rate during the summer (e.g., Florida, South Carolina, Texas) and states that
experienced high rates of excess mortality towards the end of the year. Cluster 4 (violet)
comprises only New Jersey. Cluster 5 (dark orange) contains states that had high excess
mortality rates in the spring with a moderate increase in late summer and early winter.
Note that the Heatmap function contains the cluster rows option, which re-arranges
the rows to represent a particular cluster structure (e.g., estimated by the function hclust).
The same thing can be done for clustering columns using the cluster columns option. For
the purpose of this plot we suppressed this option using cluster columns="FALSE" to have
the calendar time along the x-axis. However, it may be interesting to cluster the weeks to
identify periods of time with similar behavior across time.
To better illustrate the geography of the clusters, Figure 9.7 displays the same clusters
as Figures 9.5 and 9.6, but mapped onto the US map. The result is not identical to the
274 Functional Data Analysis with R
FIGURE 9.6: Heatmap of weekly excess mortality rates in US states and two territories.
The dendrogram from Figure 9.5 is appended to the left of the figure with the same cluster
colors as in Figure 9.5.
one obtained from K-means with three clusters, though some similarities are apparent. For
example, cluster 5 in Figure 9.7 has substantial overlap with cluster 1 in Figure 9.4. This
is the cluster with higher excess mortality rates in the spring of 2020. However, cluster 1 in
Figure 9.7 contains a combination of states that were in clusters 1 and 2 in Figure 9.4. These
are the states with low excess mortality rates for the entire year. Cluster 3 in Figure 9.7 is
a very large cluster and combines many of the states in clusters 2 and 3 in Figure 9.4.
FIGURE 9.7: US map of five estimated clusters for weekly excess mortality rates in 2020.
The method used was hierarchical clustering with square Euclidian distance between states
and Ward’s minimum variance method (method="ward.D2") between groups. The observed
data matrix (without smoothing) was used as input. Cluster colors are the same as in
Figures 9.5 and 9.6.
functions:
p
d(i1 , i2 ) = {Wi1 (sj ) − Wi2 (sj )}2 = ||Wi1 − Wi2 ||2 .
j=1
However, many different types of distances can be used and many are already coded as part
of the dist option for the hclust function in R. Of course, additional types of distances can
be used. The ward.D2 method for combining clusters uses the following distance between
clusters Ck1 and Ck2
d(k1 , k2 ) = ||Wi − µk1 ,k2 ||2 ,
i∈Ck1 ∪Ck1
where µk1 ,k2 is the mean of the Wi such that i ∈ Ck1 ∪ Ck1 . This distance is the sum of
squares of residuals after combining the two clusters. The idea is to first combine clusters
that have small combined sum of squares (dispersion) and then continue aggregating clus-
ters with the smallest combined sum of squares. Other well-known methods for combining
clusters include the following
• Single linkage clustering where the distance between clusters is
• Unweighted pair group method with arithmetic mean (UPGMA) [277] where the distance
between clusters is
1
d(k1 , k2 ) = d(Wi1 , Wi2 )
|Ck1 ||Ck2 |
i1 ∈Ck1 i2 ∈Ck2
FIGURE 9.8: US map of four estimated clusters for weekly excess mortality rates in 2020.
The method used was the Gaussian mixture model (implemented in the Mclust() function
in mclust package) on the data matrix where the observed data (without smoothing) were
used.
K
where θ = (π1 , . . . , πK , µ1 , . . . , µK , σ 2 ), πk ≥ 0 for k = 1, . . . , K, and k=1 πk = 1. This
distribution is a mixture of normal densities where the weights for each density, πk , represent
the proportion of study participants in each cluster.
Denote by B = [0, 1] × . . . × [0, 1] a neighborhood of 0p in Rp . The proof for this result
is
K
P (Wi ∈ wi + B) = P (Wi ∈ wi + B, Ci = k)
k=1
K
= P (Wi ∈ wi + B|Ci = k)P (Ci = k)
k=1
K
= πk P (Wi ∈ wi + B|Ci = k) .
k=1
278 Functional Data Analysis with R
Dividing both the left and right sides of the equation by and letting → 0 proves the result
in equation (9.2). Using similar arguments we can show that the conditional distribution of
the cluster assignment variables Ci given the data Wi = wi and the other parameters is a
categorical distribution with probabilities
πk φ(wi |µk , σ 2 )
P (Ci = k|Wi = wi , θ) = K , (9.3)
2
l=1 πl φ(wi |µl , σ )
where µ = (µt1 , . . . , µtK )t is the collection of estimated cluster centers. The proof of this
result follows from the identity (for brevity, we omit the conditioning on θ)
P (Ci = k, Wi ∈ wi + B)
P (Ci = k|Wi ∈ wi + B) =
P (Wi ∈ wi + B)
P (Wi ∈ wi + B)|Ci = k)P (Ci = k)
= K .
l=1 P (Wi ∈ wi + B)|Ci = l)P (Ci = l)
The result can be obtained by dividing both the left and right sides of the equation by
and letting go to 0.
Note that this is a different way of allocating study participants to clusters. In K-means
one typically assigns observations to the closest mean, µk . In mixture distribution clustering,
observations are randomly assigned with the probabilities described in equation (9.3). As
we will see, we can quantify the joint posterior distribution of these cluster indicators.
The standard approach for mixture distribution clustering is to use Expectation Max-
imization [61], where the cluster membership is treated as missing data. However, here
we will take a Bayesian approach, which will allow a close investigation of full conditional
distributions.
We start with setting the prior distribution for π = (π1 , . . . , πK ) as a Dirichlet distribu-
tion denoted by Dir(α, . . . , α), where α > 0. For the cluster centers we can assume that they
are a priori independent with µk ∼ N (0p , σ02 Ip ), where 0p is a p-dimensional vector of zeros,
Ip is the p-dimensional identity matrix and σ0 is large. We will discuss later restrictions on
these priors, but keep this form for presentation purposes. Finally, the prior distribution for
the 1/σ 2 is Γ(a, b) with small a and b. Here the parameterization of the Γ(a, b) distribution
is such that its mean is a/b and variance is a/b2 (the shape and rate parameters in R).
With these assumptions, the full likelihood of the observed and missing data is
n
{ [Wi |Ci , µ, σ 2 ][Ci |π]}[π][µ][σ 2 ] , (9.4)
i=1
where we followed the Bayesian notation where the bracket sign [·] denotes the distribution
and [y|x] is the conditional distribution of variable y given variable x.
From this likelihood we can derive the full conditional distributions for each model
parameter. As will be seen, all the full conditionals are closed form, making simulations from
the posterior distribution relatively easy using Gibbs sampling [95]. Indeed, this avoids the
Rosenbluth-Teller (also known as, with some substantial controversy, Metropolis-Hastings)
algorithm [113, 200, 256].
A. The full conditional of [Ci |others] for i = 1, . . . , n is proportional to
which is a categorical variable with K categories and probabilities for each category given
by (9.3). Therefore, once the probabilities π are calculated, the cluster assignments Ci are
simulated using the sample function in R.
Clustering of Functional Data 279
i=1
N +α−1 N +α−1
= π1 C,1 . . . πKC,K
∝ Dir(NC,1 + α, . . . , NC,K + α) ,
where NC,k = |{i : Ci = k}| is the number of members in cluster k and |A| denotes the
number of elements in set A.
C. The full conditional of [µk |others] is proportional to
{ [Wi |Ci , µ, σ 2 ]}[µk ] .
{i:Ci =k}
Denote by SC,k = {i:Ci =k} Wi the sum of observations in cluster k. As all distributions
in this product are Gaussian, the full conditional [µk |others] is also Gaussian and given by
SC,k 1
[µk |others] = N , I p .
N C,k + σ 2 /σ02 N C,k /σ 2 + 1/σ02
Note that when σ0 is large relative to σ 2 , the mean of the full conditional is approximately
SC,k /N C,k , the mean of the observations in cluster k. This is not dissimilar from the same
step in the K-means algorithm. An important difference is that in this context the cluster
means are treated as random variables and they have a joint posterior distribution given
the data.
C. The full conditional of [σ 2 |others] is proportional to
n
{ [Wi |Ci , µ, σ 2 ]}[σ 2 ] .
i=1
where µCi = S C,i /|Ci |. Moreover, we assumed that σ 2 has an inverse Gamma prior distri-
bution with parameters a and b. Thus,
−a−1
2 1
[σ ] ∝ exp(−b/σ 2 ) .
σ2
Multiplying these two distributions shows that
n
[σ 2 |others] = Inverse Gamma(a + np/2, b + ||Wi − µCi ||2 /2) .
i=1
In some situations, the model is not identifiable or is very close to being not identifiable.
Indeed, consider the case when data are simulated from a univariate normal distribution
280 Functional Data Analysis with R
and we are fitting a mixture of two or more normal distributions. It is possible for some of
the clusters to become empty. That is, N C,k = 0. In this situation S C,k is undefined and the
variance of [µk |others] is σ02 , which is very large. This leads to substantial instabilities in the
algorithm. Some solutions for stabilizing the Gibbs sampler include (1) impose restrictions
on the prior of µk , k = 1, . . . , K; (2) simulate from the Normal prior and take the closest
observation Wi as the mean; and (3) analyze the simulated chains for high correlations
between-cluster means.
#Load refund
library(refund)
#Set the time grid (weeks)
t <- 1:dim(Wd)[2]
Recall that the data are stored in the 52 × 52 dimensional matrix Wd. The code starts by
loading the refund package and setting the grid of observations where data are recorded. In
this case, the grid is equally spaced because data are observed every week for 52 weeks. Ap-
plying FPCA is just one line of code (two lines in this text due to space restrictions). The first
argument is the data Y=Wd, which is a matrix with study participants by row and functional
observations by column. The second argument, Y.pred=Wd, indicates that we would like
to obtain the smooth predictors of all input functions. The third argument center=TRUE
indicates that PCA is conducted after centering the data. The argument argvals=t is
self-explanatory, while knots=35 is the maximum number of knots used in the univariate
spline smoother. The argument pve=0.99 indicates that the number of principal components
used corresponds to a 0.99 proportion of variability explained (PVE) after smoothing the
Clustering of Functional Data 281
covariance operator. That is, percent variability explained after removing the noise variabil-
ity. The argument var=TRUE returns the variability for each study participant, which can
be used to obtain confidence intervals for the participant-specific function estimate.
FIGURE 9.9: Smooth covariance matrix estimator for the 2020 weekly excess mortality
rates in US states and territories. Each row and column corresponds to one of the 52 weeks
of 2020 starting from January.
282 Functional Data Analysis with R
FIGURE 9.10: Smooth correlation matrix estimator for the 2020 weekly excess mortality
rates in US states and territories. Each row and column corresponds to one of the 52 weeks
of 2020 starting from January.
during the weeks 44-52 (October 31 to December 26). This is consistent with the spring and
winter surges in excess mortality. The first surge affected more US states in the Northeast,
whereas the second surge affected more states in the Intermountain and Midwest regions.
The covariance surface also indicates large negative covariances between observations
during the weeks 10-20 (March 7 to May 16) and the period after; note the darker shades of
blue to the right of the high covariances corresponding to the first surge in excess mortality
rates. The covariance matrix is often driven by the size of the observations at a particular
time. This is why the spring and winter surges are so clearly emphasized in Figure 9.9.
However, this makes it hard to understand and visualize data correlation in periods when
excess mortality is off its peaks.
To illustrate this, Figure 9.10 displays the correlation matrix stored in the cor mat
variable. Note that some characteristics of the covariance function are preserved, though
the visuals are strikingly different. First, the main diagonal clearly indicates three periods of
mutual high correlations (spring, summer, and fall/winter). Only two were clearly identified
in the covariance plot. These three periods are so clearly identifiable because the correlations
between the beginning and end of these periods quickly transition from strongly positive
to close to zero or negative. This is illustrated by the almost rectangular shape of the high
correlations in these three periods. From our previous analysis of the data this corresponds
to the fact that states that have a large increase in excess mortality rates in one period tend
to do better in the other periods. This is especially true for the spring surge, indicating that
Clustering of Functional Data 283
FIGURE 9.11: First three functional principal components of the excess mortality data in
the US in 2020. The x-axis corresponds to time expressed in weeks starting with the first
week in January.
states in the Northeast, which had the worst surge in excess mortality rates during spring,
tended to do much better in the summer and fall.
Figure 9.11 displays the first three functional principal components (PCs), which explain
86% of the variability. The first PC (shown in green) has a substantial increase between
March and May, is close to zero, and even becomes negative after September. This indicates
that states that have large positive scores on this component had a large excess mortality
rates in spring, though the excess mortality stayed at or below average for the remainder
of 2020. New Jersey is the state with the largest score on this component. The second PC
has a smaller bump in excess mortality in spring, a dip towards zero in the summer, and a
larger bump in late fall and early winter. States with large positive scores on this component
had a smaller excess mortality rate in the spring, stayed close to the average during the
summer, and had a higher than average excess mortality rate in late fall and early winter.
South Dakota is the state with the largest score on this component. The third PC decreases
steadily from March to June, with a strong dip between the end of the June and September,
a slow increase through November, and a fast decrease through December. States with large
negative scores on this component had a steady, slow increase in excess mortality rate from
March till June, a sudden increase in the summer, an average or below excess mortality
rate in November, and a turn for the worse in the last few weeks of the year. Arizona is the
state with the largest score on this component.
These analyses again emphasize the principal directions of variation in the functional
space. They are quite interpretable and consistent with the other analyses of the data.
Each function can now be represented as the set of scores on the principal components.
The idea is to use these scores for clustering instead of the original data. The R code be-
low indicates how to obtain the principal scores, which are stored in PC scores. This is a
284 Functional Data Analysis with R
FIGURE 9.12: Scatter plots of principal component scores on the first three functional
principal components of mortality data in the US in 2020. Principal component one scores
are shown on the x-axis and the principal component scores on two and three are shown on
the y-axis in the top and bottom panel, respectively.
Clustering of Functional Data 285
FIGURE 9.13: An example of synthetic functional data with two known groups shown in
red and blue respectively. The first group consists of constant functions and the second
group consists of sinusoidal functions with different amplitudes.
In this example, there is not much estimated residual variability. Indeed, the noise vari-
ance is estimated to be only ∼ 0.5% of the overall variability. Thus, decomposing the
observed covariance into principal components reduces the dimensionality, but does not
substantially impact the smoothness of the data. So, when would functional smoothing
actually improve results?
where ij ∼ N (0, 1) are mutually independent random errors. Therefore, for each simulation
and value of σ the simulated data is a matrix W of dimension 202 × 101, where the first
286 Functional Data Analysis with R
101 rows correspond to the functions in cluster 1 and the remaining rows correspond to the
functions in cluster 2. Larger values of σ correspond to more added noise and a value of
zero corresponds to no noise. For each simulated data instance we applied two clustering
approaches, both based on K-means with two clusters (the true number of clusters in the
data). The first uses the raw, simulated data W, while the second uses the scores on
the functional principal components that explain 99% of the variability after smoothing
the covariance function.
Figure 9.14 displays the misclassification rate for K-means clustering using the raw,
unsmoothed data (green), and the scores obtained from functional PCA (red). The two
lines are the averages over 500 simulations for each value of σ. The x-axis is the standard
deviation of the added noise, σ. As expected, both approaches have zero misclassification
rate when σ = 0 and very small and comparable misclassification rates when σ = 1, 2.
However, for larger values of noise, clustering using FPCA has lower misclassification rates.
For example, when σ = 4 the misclassification rate when using the raw data is 0.39 compared
to 0.26 when using FPCA scores.
This simulation study indicates that when data have substantial noise it may be a
good idea to conduct smoothing first using, for example, FPCA and then use clustering.
From a practical perspective, we suggest applying both approaches, comparing results, and
analyzing discrepancies between results. These discrepancies should be small when the noise
is small, but could be substantial when noise is large.
There is still the small issue of how misclassification rate is actually defined. This sounds
like a minor detail, but it is one that deserves some attention. Consider our case when we
FIGURE 9.14: Misclassification rates using K-means with two clusters on the raw data
(green) and FPCA scores with 99% of variance explained (red). True data are shown in
Figure 9.13 and simulated data were obtained by adding independent Normal white noise
with standard deviation σ at 101 equally spaced points around each curve. The x-axis shows
σ and the y-axis shows the average misclassification rate over 500 simulations.
Clustering of Functional Data 287
know the true labels, say T L1 , . . . , T Ln , and we use a clustering algorithm that provides
another set of labels, say C1 , . . . , Cn . In our case there are two clusters and the first 101
true labels are ”Red” and the last 101 labels are ”Blue”. The clustering algorithm provides
labels Ci ∈ {1, 2}. There are two possibilities, one where “Red” corresponds to cluster “1”
and one where it corresponds to cluster “2”. The only way to know which makes more sense
is to investigate if “Red”=“1” or “Red”=“2” leads to fewer misclassified individuals. We
define the misclassification rate as
min[A, B]/n ,
FIGURE 9.15: Smooth predictions of the CD4 counts data using the fpca.sparse function
in R. Colors indicate the estimated clusters for each study participant. Thicker lines of a
darker shade of the same color indicate the estimated cluster centers.
FIGURE 9.16: Nine estimated clusters for daily physical activity measured in minute-level
MIMS in the NHANES 2011–2014 study. The method used was K-means on the physical
activity matrix. The cluster was ordered based on the average age of study participants
within the cluster. The proportion of participants in each cluster is shown on the title of
each panel.
We also consider the case of NHANES data. Here we use the same NHANES dataset
as that introduced in Chapter 7, which contains 8,713 study participants with time to
all-cause-mortality information. To start, we take the study participant-specific average
physical activity at every time point (minute) of the day over eligible days. These average
trajectories are clustered using K-means with nine clusters. Clusters are ordered in terms
of the average age of study participants within the cluster from youngest (Cluster 1) to
oldest (Cluster 9). However, age was not used to perform K-means clustering. Labeling of
clusters is not unique, which is why one needs to be precise when defining what is meant
by “Cluster 4.”
The cluster centers are displayed in Figure 9.16. Each cluster is accompanied by the
percent study participants in that cluster. For example, 6.31% of the study participants
are in Cluster 1 and 10.83% are in Cluster 9. While most clusters have a clear pattern of
higher activity during the day, Cluster 2, which contains 4.46% of the population seems to
correspond to individuals who, on average, have higher activity during the night. This may
be primarily nigh-shift workers. There is a striking difference between the physical activity
290 Functional Data Analysis with R
of individuals in Clusters 1 and 9. This is likely due to the fact that study participants in
Cluster 9 are older, may suffer of more severe disease and/or may be at higher risk of death.
To further investigate the composition of each cluster, Table 9.1 displays some summary
characteristics within each cluster. The first column provides the percent of individuals who
died in each cluster by December 31, 2019. Notice that in Cluster 1 with an average age of
33.90 years, 2.2% of individuals died. This can be compared to 15.5% in Cluster 8 (average
age 59.26 years) and 34.9% in Cluster 9 (average age 60.99 years). While the average age
for Cluster 8 is about 1.7 years lower than in Cluster 9, this does not completely explain the
large difference in mortality. Individuals in Cluster 9 have a smaller Poverty-Income Ratio
(PIR) than those in Cluster 8. PIR is an index calculated by dividing family income by a
poverty threshold specific to the family size. Smaller PIR corresponds to being poorer.
As discussed, the center of Cluster 2 has a very different shape from the other centers,
which may indicate that Cluster 2 may include night-shift workers. This is further supported
by the fact that the average age of individuals in this cluster is 35.36 years. This is much
younger than the average age of the NHANES sample (48.76 years old). Study participants in
Clusters 1, 3, and 5 have relatively lower BMI. In Figure 9.16 these three clusters correspond
to the highest average physical activity intensity values during the day. In contrast, study
participants in Cluster 9 have the highest proportion of all-cause mortality, highest average
age and average BMI. Since the physical activity intensity average of this group is much
lower across all times of the day, this suggests that it could be used to identify individuals
who have elevated health and mortality risks.
Here we have used the means of physical activity profiles, though this is not neces-
sary. Indeed, a multilevel functional approach can be used to decompose functional data
at different levels (e.g., study participant average versus daily activity profile) and conduct
clustering separately at each level; see, for example, [137, 269]. This has important impli-
cations as two study participants may belong to a particular cluster based on their average
profile, while their day-to-day variation may place them in completely different clusters.
Whether this is of any relevance to their health remains to be investigated. But, then again,
something, somewhere, always does.
[1] A. Aguilera, F. Ocaña, and M. Valderrama. Forecasting with unequally spaced data
by a functional principal component approach. Test, 8(1):233–253, 1999.
[2] A. Ait-Saı̈di, F. Ferraty, R. Kassa, and P. Vieu. Cross-validated estimations in the
single-functional index model. Statistics, 42(6):475–494, 2008.
[3] H. Akaike. A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19(6):716–723, 1974.
[4] M.R. Anderberg. Cluster Analysis for Applications. Academic Press: New York, 1973.
[5] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering Points
To Identify the Clustering Structure. ACM SIGMOD international conference on
Management of data. ACM Press, pages 49–60, 1999.
[6] A. Antoniadis and T. Sapatinas. Estimation and inference in functional mixed-effects
models. Computational Statistics & Data Analysis, 51(10):4793–4813, 2007.
[7] E. Arias, B. Tejada-Vera, and F. Ahmad. Provisional life expectancy estimates for
January through June, 2020. Vital Statistics Rapid Release: National Vital Statistics
System, 10:1–4, 2021.
[8] J.A.D. Aston, J.-M. Chiou, and J.P. Evans. Linguistic pitch analysis using functional
principal component mixed effect models. Journal of the Royal Statistical Society:
Series C (Applied Statistics), 59(2):297–317, 2010.
[9] V. Baladandayuthapani, B.K. Mallick, M.Y. Hong, J.R. Lupton, N.D. Turner, and
R.J. Carroll. Bayesian hierarchical spatially correlated functional data analysis with
application to colon carcinogenesis. Biometrics, 64(1):64–73, 2008.
[10] D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixed-effects models
using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
[11] A. Bauer, F. Scheipl, H. Küchenhoff, and A.-A. Gabriel. An introduction to semipara-
metric function-on-scalar regression. Statistical Modelling, 18(3-4):346–364, 2018.
[12] A. Baı́llo and A. Grané. Local linear regression for functional predictor and scalar
response. Journal of Multivariate Analysis, 100(1):102–111, 2009.
[13] J.M. Beckers and R. Michel. EOF calculations and data filling from incomplete
oceanographic datasets. Journal of Atmospheric and Oceanic Technology, 20, 12 2003.
[14] R. Bender, T. Augustin, and M. Blettner. Generating survival times to simulate cox
proportional hazards models. Statistics in Medicine, 24(11):1713–1723, 2005.
[15] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society: Series
B (Methodological), 57(1):289–300, 1995.
291
292 Bibliography
[16] U. Beyaztas and H.L. Shang. On function-on-function regression: Partial least squares
approach. Environmental and Ecological Statistics, 27:95–114, 2020.
[17] K. Bhaskaran, I. dos Santos-Silva, D.A. Leon, I.J. Douglas, and L. Smeeth. Association
of BMI with overall and cause-specific mortality: A population-based cohort study of
3.6 million adults in the UK. Lancet Diabetes & Endocrinology, 6:944–953, 2018.
[18] P. Billingsley. Convergence of Probability Measures, 2nd Edition. Probability and
Statistics. Wiley, 1999.
[28] T.T. Cai and M. Yuan. Nonparametric covariance function estimation for functional
and longitudinal data. Technical Report, 2010.
[29] E.J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE,
98(6):925–936, 2010.
[30] E.J. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix com-
pletion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
[31] H. Cardot, C. Crambes, and P. Sarda. Quantile regression when the covariates are
functions. Nonparametric Statistics, 17(7):841–856, 2005.
Bibliography 293
[33] H. Cardot, C. Goga, and P. Lardin. Uniform convergence and asymptotic confidence
bands for model-assisted estimators of the mean of sampled functional data. Electronic
Journal of Statistics, 7:562–596, 2013.
[34] H. Cardot, C. Goga, and P. Lardin. Variance estimation and asymptotic confidence
bands for the mean estimator of sampled functional data with high entropy unequal
probability sampling designs. Scandinavian Journal of Statistics, 41(2):516–534, 2014.
[39] W. Checkley, L.D. Epstein, R.H. Gilman, L. Cabrera, and R.E. Black. Effects of acute
diarrhea on linear growth in Peruvian children. American Journal of Epidemiology,
157:166–175, 2003.
[40] K. Chen and H.-G. Müller. Conditional quantile analysis when covariates are func-
tions, with application to growth data. Journal of the Royal Statistical Society Series
B: Statistical Methodology, 74(1):67–89, 2012.
[41] K. Chen and H.-G. Müller. Modeling repeated functional observations. Journal of
the American Statistical Association, 107(500):1599–1609, 2012.
[42] S. Chib and E. Greenberg. Understanding the Metropolis-Hastings algorithm. The
American Statistician, 49(4):327–335, 1995.
[43] J.-M. Chiou. Dynamical functional prediction and classification, with application to
traffic flow prediction. The Annals of Applied Statistics, 6:1588–1614, 2016.
[44] J.-M. Chiou, H.-G. Müller, and J.-L. Wang. Functional quasi-likelihood regression
models with smooth random effects. Journal of the Royal Statistical Society. Series
B (Statistical Methodology), 65(2):405–423, 2003.
[45] J.-M. Chiou, H.-G. Müller, and J.-L. Wang. Functional response models. Statistica
Sinica, 14(3):675–693, 2004.
294 Bibliography
[46] J.-M. Chiou, Y.-F. Yang, and Chen Y.-T. Multivariate functional linear regression
and prediction. Journal of Multivariate Analysis, 146:301–312, 2016.
[47] D.R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society.
Series B (Methodological), 34(2):187–220, 1972.
[48] C.M. Crainiceanu and A.J. Goldsmith. Bayesian functional data analysis using Win-
BUGS. Journal of Statistical Software, 32(11):1–33, 2010.
[49] C.M. Crainiceanu and D. Ruppert. Likelihood ratio tests in linear mixed models with
one variance component. Journal of the Royal Statistical Society. Series B (Statistical
Methodology), 66(1):165–185, 2004.
[50] C.M. Crainiceanu, D. Ruppert, G. Claeskens, and M.P. Wand. Exact likelihood ratio
tests for penalised splines. Biometrika, 92(1):91–103, 3 2005.
[51] C.M. Crainiceanu, D. Ruppert, and M.P. Wand. Bayesian analysis for penalized spline
regression using winbugs. Journal of Statistical Software, 14(14):1–24, 2005.
[52] C.M. Crainiceanu, A.-M. Staicu, and C.Z. Di. Generalized multilevel functional re-
gression. Journal of the American Statistical Association, 104(488):1550–1561, 2009.
[53] C.M. Crainiceanu, A.-M. Staicu, S. Ray, and N.M. Punjabi. Bootstrap-based inference
on the difference in the means of two correlated functional processes. Statistics in
Medicine, 31(26):3223–3240, 2012.
[54] P. Craven and G. Wahba. Smoothing noisy data with spline functions. Numerische
Mathematik, 1:377–403, 1979.
[55] E. Cui. Functional Data Analysis Methods for Large Scale Physical Activity Stud-
ies. PhD thesis, Johns Hopkins University, Baltimore, MD, June 2023. Avail-
able at https://jscholarship.library.jhu.edu/bitstream/handle/1774.2/68330/CUI-
DISSERTATION-2023.pdf?sequence=1&isAllowed=y.
[56] E. Cui, C.M. Crainiceanu, and A. Leroux. Additive functional Cox model. Journal
of Computational and Graphical Statistics, 30(3):780–793, 2021.
[57] E. Cui, A. Leroux, E. Smirnova, and C.M. Crainiceanu. Fast univariate inference for
longitudinal functional models. Journal of Computational and Graphical Statistics,
31(1):219–230, 2022.
[58] E. Cui, R. Li, C.M. Crainiceanu, and L. Xiao. Fast multilevel functional principal
component analysis. Journal of Computational and Graphical Statistics, 32(2):366–
377, 2023.
[59] E. Cui, E.C. Thompson, R.J. Carroll, and D. Ruppert. A semiparametric risk score
for physical activity. Statistics in Medicine, 41(7):1191–1204, 2022.
[60] C. de Boor. A Practical Guide to Splines. Applied Mathematical Sciences. Springer,
2001.
[61] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Method-
ological), 39(1):1–38, 1977.
[62] C.Z. Di, C.M. Crainiceanu, B. Caffo, and N.M. Punjabi. Multilevel functional principal
component analysis. Annals of Applied Statistics, 3(1):458–488, 2009.
Bibliography 295
[63] C.Z. Di, C.M. Crainiceanu, and W.S. Jank. Multilevel sparse functional principal
component analysis. Stat, 3(1):126–143, 2014.
[64] J. Di, A. Leroux, J. Urbanek, R. Varadhan, A. Spira, and V. Zipunnikov. Patterns
of sedentary and active time accumulation are associated with mortality in US: The
NHANES study. bioRxiv, 08 2017.
[65] K.M. Diaz, V.J. Howard, B. Hutto, N. Colabianchi, J.E. Vena, M.M. Safford, S.N.
Blair, and S.P. Hooker. Patterns of sedentary behavior and mortality in U.S.
middle-aged and older adults: A national cohort study. Annals of Internal Medicine,
167(7):465–475, 2017.
[66] P. Diggle, P. Heagerty, K.Y. Liang, and S. Zeger. Analysis of Longitudinal Data, 2nd
edition. Oxford, England: Oxford University Press, 2002.
[67] J.D. Dixon. Estimating extremal eigenvalues and condition numbers of matrices.
SIAM Journal on Numerical Analysis, 20(4):812–814, 1983.
[68] S. Dray and J. Josse. Principal component analysis with missing values: a comparative
survey of methods. Plant Ecology, 216, 05 2014.
[69] R.M. Dudley. Sample functions of the Gaussian process. The Annals of Probability,
1(1):66–103, 1973.
[70] P.H.C. Eilers, B. Li, and B.D. Marx. Multivariate calibration with single-index signal
regression. Chemometrics and Intelligent Laboratory Systems, 96(2):196–202, 2009.
[71] P.H.C. Eilers and B.D. Marx. Flexible smoothing with B-splines and penalties. Sta-
tistical Science, 11(2):89–121, 1996.
[72] P.H.C. Eilers and B.D. Marx. Generalized linear additive smooth structures. Journal
of Computational and Graphical Statistics, 11(4):758–783, 2002.
[73] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for dis-
covering clusters in large spatial databases with noise. Proceedings of the Second In-
ternational Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI
Press, 11:226–231, 1996.
[74] B. Everitt. Cluster Analysis. London: Heinemann Educational Books, 1974.
[75] W.F. Fadel, J.K. Urbanek, N.W. Glynn, and J. Harezlak. Use of functional linear
models to detect associations between characteristics of walking and continuous re-
sponses using accelerometry data. Sensors (Basel), 20(21):6394, 2020.
[76] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications: Monographs
on Statistics and Applied Probability 66, volume 66. CRC Press, 1996.
[77] J. Fan and J.-T. Zhang. Two-step estimation of functional linear models with ap-
plications to longitudinal data. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 62(2):303–322, 2000.
[78] Y. Fan, N. Foutz, G.M. James, and W. Jank. Functional response additive model esti-
mation with online virtual stock markets. The Annals of Applied Statistics, 8(4):2435–
2460, 2014.
[79] J.J. Faraway. Regression analysis for a functional response. Technometrics, 39(3):254–
261, 1997.
296 Bibliography
[82] F. Ferraty, P. Hall, and P. Vieu. Most-predictive design points for functional data
predictors. Biometrika, 97(4):807–824, 12 2010.
[83] F. Ferraty, A. Laksaci, and P. Vieu. Functional time series prediction via conditional
mode estimation. Comptes Rendus Mathematique, 340(5):389–392, 2005.
[84] F. Ferraty, A. Mas, and P. Vieu. Nonparametric regression on functional data: Infer-
ence and practical aspects. Australian & New Zealand Journal of Statistics, 49(3):267–
286, 2007.
[85] F. Ferraty, J. Park, and P. Vieu. Estimation of a functional single index model. In
Recent Advances in Functional Data Analysis and Related Topics, pages 111–116.
Springer, 2011.
[86] F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Prac-
tice. Springer: New York, NY, USA, 2006.
[87] G. Fitzmaurice, M. Davidian, G. Molenberghs, and G. Verbeke. Longitudinal Data
Analysis. Boca Raton, FL: Chapman & Hall/CRC, 2008.
[88] E.W. Forgy. Cluster analysis of multivariate data: Efficiency vs interpretability of
classifications. Biometrics, 21:768–769, 1965.
[89] D. Fourdrinier, W.E. Strawderman, and M.T. Wells. Estimation of a functional single
index model. In Shrinkage Estimation, pages 127–150. Springer, 2018.
[90] T. Galili. dendextend: An R package for visualizing, adjusting, and comparing trees
of hierarchical clustering. Bioinformatics, 2015.
[91] M. Gaston, T. Leon, and F. Mallor. Functional data analysis for non-homogeneous
Poisson processes. In 2008 Winter Simulation Conference, pages 337–343, 2008.
[92] I. Gaynanova, N. Punjabi, and C.M. Crainiceanu. Modeling continuous glucose mon-
itoring (CGM) data during sleep. Biostatistics, 23(1):223–239, 05 2020.
[93] J.E. Gellar, E. Colantuoni, D.M. Needham, and C.M. Crainiceanu. Variable-domain
functional regression for modeling ICU data. Journal of American Statistical Associ-
ation, 109(508):1425–1439, 2014.
[94] J.E. Gellar, E. Colantuoni, D.M. Needham, and C.M. Crainiceanu. Cox regression
models with functional covariates for survival data. Statistical Modelling, 15(3):256–
278, 2015.
[95] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, PAMI-6(6):721–741, 1984.
[96] A. Genz and F. Bretz. Computation of Multivariate Normal and t Probabilities. Lec-
ture Notes in Statistics. Springer-Verlag, Heidelberg, 2009.
Bibliography 297
[97] A. Genz, F. Bretz, T. Miwa, X. Mi, F. Leisch, F. Scheipl, and T. Hothorn. mvtnorm:
Multivariate Normal and t Distributions, 2021. R package version 1.1-3.
[100] J. Gertheiss, E.F. Hessel V. Maier, and A.-M. Staicu. Marginal functional regression
models for analyzing the feeding behavior of pigs. Journal of Agricultural, Biological,
and Environmental Statistics, 20:353–370, 2015.
[101] Y. Goldberg, Y. Ritov, and A. Mandelbaum. Predicting the continuation of a function
with applications to call center data. Journal of Statistical Planning and Inference,
147:53–65, 2014.
[102] A.J. Goldsmith, J. Bobb, C.M. Crainiceanu, B. Caffo, and D. Reich. Penalized func-
tional regression. Journal of Computational and Graphical Statistics, 20(4):830–851,
2011.
[103] A.J. Goldsmith, C.M. Crainiceanu, B. Caffo, and D. Reich. Longitudinal penalized
functional regression for cognitive outcomes on neuronal tract measurements. Journal
of the Royal Statistical Society: Series C (Applied Statistics), 61(3):453–469, 2012.
[104] A.J. Goldsmith, S. Greven, and C.M. Crainiceanu. Corrected confidence bands for
functional data using principal components. Biometrics, 69(1):41–51, 2013.
[105] A.J. Goldsmith, F. Scheipl, L. Huang, J. Wrobel, C. Di, J. Gellar, J. Harezlak, M.W.
McLean, B. Swihart, L. Xiao, C.M. Crainiceanu, and P.T. Reiss. refund: Regression
with Functional Data, 2020. R package version 0.1-23.
[109] S. Greven, C.M. Crainiceanu, B. Caffo, and D.S. Reich. Longitudinal functional prin-
cipal component analysis. Electronic Journal of Statistics, pages 1022–1054, 2010.
[110] S. Greven and F. Scheipl. A general framework for functional regression modelling.
Statistical Modelling, 17:1–35, 2017.
[111] Z. Gu, R. Eils, and M. Schlesner. Complex heatmaps reveal patterns and correlations
in multidimensional genomic data. Bioinformatics, 2016.
[112] Z. Gu, L. Gu, R. Eils, M. Schlesner, and B. Brors. circlize implements and enhances
circular visualization in r. Bioinformatics, 30:2811–2812, 2014.
[113] J.E. Gubernatis. Marshall Rosenbluth and the Metropolis Algorithm. Physics of
Plasmas, 12(5):57303, 2005.
298 Bibliography
[132] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988.
[133] P. Jain, C. Jin, S.M. Kakade, P. Netrapalli, and A. Sidford. Streaming PCA: Matching
matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In
V. Feldman, A. Rakhlin, and O. Shamir, editors, 29th Annual Conference on Learning
Theory, volume 49 of Proceedings of Machine Learning Research, pages 1147–1164,
Columbia University, New York, New York, USA, 23–26 Jun 2016.
[134] G. James, T. Hastie, and C. Sugar. Principal component models for sparse functional
data. Biometrika, 87(3):587–602, 2000.
[135] G.M James, J. Wang, and J. Zhu. Functional linear regression that’s interpretable.
Annals of Statistics, 37(5A):2083–2108, 2009.
[136] B.J. Jefferis, T.J. Parsons, C. Sartini, S. Ash, L.T. Lennon, O. Papacosta, R.W.
Morris, S.G. Wannamethee, I.-M. Lee, and P.H. Whincup. Objectively measured
physical activity, sedentary behaviour and all-cause mortality in older men: Does
volume of activity matter more than pattern of accumulation? British Journal of
Sports Medicine, 53(16):1013–1020, 2019.
[137] H. Jiang and N. Serban. Clustering random curves under spatial interdependence
with application to service accessibility. Technometrics, 54(2):108–119, 2012.
[138] D. John, Q. Tang, F. Albinali, and S.S. Intille. A monitor-independent movement
summary to harmonize accelerometer data processing. Human Kinetics Journal,
2(4):268–281, 2018.
[139] I.T. Jolliffe. A note on the use of principal components in regression. Journal of the
Royal Statistical Society, Series C, 31(3):300–303, 1982.
[142] M. Karas, J. Muschelli, A. Leroux, J.K. Urbanek, A.A. Wanigatunga, J. Bai, C.M.
Crainiceanu, and J.A. Schrack. Comparison of accelerometry-based measures of phys-
ical activity: Retrospective observational data analysis study. JMIR Mhealth Uhealth,
10(7):e38077, Jul 2022.
[143] K. Karhunen. Über lineare Methoden in der Wahrscheinlichkeitsrechnung. Annals of
the Academy of Science Fennicae. Series A. I. Mathematics-Physics, 37:1–79, 1947.
[144] R.A. Kaslow, D.G. Ostrow, R. Detels, J.P. Phair, B.F. Polk, and C.R. Rinaldo Jr. The
multicenter AIDS cohort study: Rationale, organization, and selected characteristics
of the participants. American Journal of Epidemiology, 126(2):310–318, 1987.
[145] R.E. Kass and V. Ventura. A spike train probability model. Neural Computation,
13:1713–1720, 2001.
[146] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster
Analysis. New York: Wiley, 1990.
[147] R.C. Kelly and R.E. Kass. A framework for evaluating pairwise and multiway syn-
chrony among stimulus-driven neurons. Neural Computation, 24:2007–2032, 2012.
300 Bibliography
[148] J.S. Kim, A.-M. Staicu, A. Maity, R.J. Carroll, and D. Ruppert. Additive function-on-
function regression. Journal of Computational and Graphical Statistics, 27(1):234–244,
2018.
[149] K. Kim, D. Sentürk, and R. Li. Recent history functional linear models for sparse
longitudinal data. Journal of Statistical Planning and Inference, 141(4):1554–1566,
2011.
[150] G.S. Kimeldorf and G. Wahba. A correspondence between bayesian estimation on
stochastic processes and smoothing by splines. The Annals of Mathematical Statistics,
41(2):495–502, 1970.
[151] J.P Klein and M.L. Moeschberger. Survival Analysis: Techniques for Censored and
Truncated Data (2nd ed.). Springer, 2003.
[152] G.G. Koch. Some further remarks concerning “A general approach to the estimation
of variance components”. Technometrics, 10(3):551–558, 1968.
[161] N.M. Laird and J.H. Ware. Random-effects models for longitudinal data. Biometrics,
38(4):963–974, 1982.
[163] H. Lennon, M. Sperrin, E. Badrick, and A.G. Renehan. The obesity paradox in cancer:
A review. Current Oncology Reports, 18(9):56, 2016.
Bibliography 301
[179] Y. Li, N. Wang, and R.J. Carroll. Generalized functional linear models with semi-
parametric single-index interactions. Journal of the American Statistical Association,
105(490):621–633, 2010.
[180] Z. Li and S.N. Wood. Faster model matrix crossproducts for large generalized linear
models with discretized covariates. Statistics and Computing, 30:19–25, 2020.
[181] M.A. Lindquist. The statistical analysis of fMRI data. Statistical Science, 23(4):439–
464, 2008.
[182] M.A. Lindquist. Functional causal mediation analysis with an application to brain
connectivity. Journal of American Statistica Association, 107(500):1297–1309, 2012.
[183] S.P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information
Theory. Technical Note, Bell Laboratories., 28:128–137, 1957, 1982.
[184] M. Loève. Probability Theory, Vol. II, 4th ed. Graduate Texts in Mathematics.
Springer-Verlag, 1978.
[191] E.J. Malloy, J.S. Morris, S.D. Adar, H. Suh, D.R. Gold, and B.A. Coull. Wavelet-
based functional linear mixed models: An application to measurement error–corrected
distributed lag models. Biostatistics, 11(3):432–452, 2010.
[192] B.D. Marx and P.H.C. Eilers. Generalized linear regression on sampled signals and
curves: A p-spline approach. Technometrics, 41(1):1–13, 1999.
[193] B.D. Marx and P.H.C. Eilers. Multidimensional penalized signal regression. Techno-
metrics, 47(1):13–22, 2005.
[194] M. Matabuena, A. Petersen, J.C. Vidal, and F. Gude. Glucodensities: A new repre-
sentation of glucose profiles using distributional data analysis. Statistical Methods in
Medical Research, 30(6):1445–1464, 2021.
Bibliography 303
[195] H. Matsui, S. Kawano, and S. Konishi. Regularized functional regression modeling for
functional response and predictors. Journal of Math-for-Industry, 1:17–25, 06 2013.
[196] C.E. McCulloch, S.R. Searle, and J.M. Neuhaus. Generalized, Linear, and Mixed
Models, 2nd edition. New York: Wiley, 2008.
[197] G.J. McLachlan and D. Peel. Finite Mixture Models. New York: Wiley, 2000.
[198] M.W. McLean, G. Hooker, A.-M. Staicu, F. Scheipl, and D. Ruppert. Functional
generalized additive models. Journal of Computational and Graphical Statistics,
23(1):249–269, 2014.
[199] J. Mercer. Functions of positive and negative type and their connection with the
theory of integral equations. Philosophical Transactions of the Royal Society, 209:4–
415, 1909.
[200] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equa-
tion of state calculations by fast computing machines. Journal of Chemical Physics,
21(6):1087–1092, 1953.
[201] M. Meyer, B. Coull, F. Versace, P. Cinciripini, and J. Morris. Bayesian function-on-
function regression for multilevel functional data. Biometrics, 71:563–574, 03 2015.
[202] B. Mirkin. Mathematical Classification and Clustering. Kluwer Academic Publishers,
1996.
[203] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming pca. In C.J.C.
Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Ad-
vances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.,
2013.
[204] J.S. Morris. Statistical methods for proteomic biomarker discovery based on feature
extraction or functional modeling approaches. Statistics and Its Interface, 5 1:117–
135, 2012.
[205] J.S. Morris. Functional regression. Annual Review of Statistics and Its Application,
2(1):321–359, 2015.
[206] J.S. Morris, P.J. Brown, R.C. Herrick, K.A. Baggerly, and K.R. Coombes. Bayesian
analysis of mass spectrometry proteomic data using wavelet-based functional mixed
models. Biometrics, 64(2):479–489, 2008.
[207] J.S. Morris and R.J. Carroll. Wavelet-based functional mixed models. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 68(2):179–199, 2006.
[208] J.S. Morris, M. Vannucci, P.J. Brown, and R.J. Carroll. Wavelet-based nonparametric
modeling of hierarchical functions in colon carcinogenesis. Journal of the American
Statistical Association, 98(463):573–583, 2003.
[209] F. Mosteller and J.W. Tukey. Data analysis, including statistics. In G. Lindzey and
E. Aronson, editors, Handbook of Social Psychology, Vol. 2. Addison-Wesley, 1968.
[210] H.-G. Müller, Y. Wu, and F. Yao. Continuously additive models for nonlinear func-
tional regression. Biometrika, 100(3):607–622, 2013.
304 Bibliography
[228] H.D. Patterson and R. Thompson. Recovery of inter-block information when block
sizes are unequal. Biometrika, 58(3):545, 1971.
[229] K. Pearson. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine, 2(11):559–572, 1901.
[230] J. Pinheiro, D. Bates, S. DebRoy, D. Sarkar, and R Core Team. nlme: Linear and
Nonlinear Mixed Effects Models, 2020. R package version 3.1-149.
[231] J. Pinheiro and D.M. Bates. Mixed-effects models in S and S-PLUS. Statistics and
Computing. Springer New York, NY, USA, 2006.
[232] M. Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs
sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical
Computing, page 1–10, 2003.
[233] C. Preda. Regression models for functional data by reproducing kernel Hilbert spaces
methods. Journal of Statistical Planning and Inference, 137(3):829–840, 2007.
[234] C. Preda and J. Schiltz. Functional PLS regression with functional response: The
basis expansion approach. In Proceedings of the 14th Applied Stochastic Models and
Data Analysis Conference. Universita di Roma La Spienza, page 1126–1133, 2011.
[235] A. Prelić, S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hen-
nig, L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclustering
methods for gene expression data. Bioinformatics, 22(9):1122–1129, 2006.
[236] N. Pya. scam: Shape Constrained Additive Models, 2021. R package version 1.2-12.
[237] X. Qi and R. Luo. Function-on-function regression with thousands of predictive curves.
Journal of Multivariate Analysis, 163:51–66, 2018.
[238] S. Qu, J.-L. Wang, and X. Wang. Optimal estimation for the functional Cox model.
The Annals of Statistics, 44(4):1708–1738, 2016.
[239] S.F. Quan, B.V. Howard, C. Iber, J.P. Kiley, F.J. Nieto, G.T. O’Connor, D.M.
Rapoport, S. Redline, J. Robbins, J.M. Samet, and P.W. Wahl. The Sleep Heart
Health Study: Design, rationale, and methods. Sleep, 20(12):1077–1085, 1997.
[240] R Core Team. R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria, 2020.
[241] J.O. Ramsay and C.J. Dalzell. Some tools for functional data analysis. Journal of the
Royal Statistical Society: Series B (Methodological), 53(3):539–561, 1991.
[242] J.O. Ramsay, G. Hooker, and S. Graves. Functional data analysis with R and MAT-
LAB. Springer New York, NY, USA, 2009.
[243] J.O. Ramsay, K.G. Munhall, V.L. Gracco, and D.J. Ostry. Functional data analyses
of lip motion. Journal of the Acoustical Society of America, 99(6):3718–3727, 1996.
[244] J.O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer New York,
NY, USA, 1997.
306 Bibliography
[245] J.O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer New York,
NY, USA, 2005.
[246] J.O. Ramsay, H. Wickham, S. Graves, and G. Hooker. FDA: Functional Data Analysis,
2014.
[247] C.R. Rao. Some statistical methods for comparison of growth curves. Biometrics,
14(1):1–17, 1958.
[248] S.J. Ratcliffe, G.Z. Heller, and L.R. Leader. Functional data analysis with application
to periodically stimulated foetal heart rate data. II: Functional logistic regression.
Statistics in Medicine, 21(8):1115–1127, 2002.
[258] D. Ruppert, M.P. Wand, and R.J. Carroll. Semiparametric Regression. Cambridge
Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2003.
[259] P.F. Saint-Maurice, R.P. Troiano, D. Berrigan, W.E. Kraus, and C.E. Matthews.
Volume of light versus moderate-to-vigorous physical activity: Similar benefits for
all-cause mortality? Journal of the American Heart Association, 7(7), 2018.
[260] F. Scheipl, J. Gertheiss, and S. Greven. Generalized functional additive mixed models.
Electronic Journal of Statistics, 10(1):1455–1492, 2016.
[261] F. Scheipl, A.J. Goldsmith, and J. Wrobel. tidyfun: Tools for Tidy Functional Data,
2022. https://github.com/tidyfun/tidyfun, https://tidyfun.github.io/tidyfun/.
Bibliography 307
[277] R.R. Sokal and C.D. Michener. A statistical method for evaluating systematic rela-
tionships. University of Kansas Science Bulletin, 38, 1958.
[278] A.-M. Staicu, M.N. Islam, R. Dumitru, and E. van Heugten. Longitudinal Dynamic
Functional Regression. Journal of the Royal Statistical Society Series C: Applied
Statistics, 69(1):25–46, 09 2019.
[279] A.M. Staicu, C.M. Crainiceanu, and R.J. Carroll. Fast methods for spatially correlated
multilevel functional data. Biostatistics, 11(2):177–194, April 2010.
[280] A.M. Staicu, C.M. Crainiceanu, D.S. Reich, and D. Ruppert. Modeling functional
data with spatially heterogeneous shape characteristics. Biometrics, 68(2):331–343,
2018.
[281] Stan Development Team. The Stan Core Library, 2018. Version 2.18.0.
[282] Stan Development Team. RStan: The R interface to Stan, 2022. R package version
2.21.7.
[283] J. Staniswalis and J. Lee. Nonparametric regression analysis of longitudinal data.
Journal of the American Statistical Association, 93(444):1403–1418, 1998.
[284] J.L. Stone and A.H. Norris. Activities and attitudes of participants in the Baltimore
Longitudinal Study. The Journals of Gerontology, 21:575–580, 1966.
[285] J. Sun, D. Tao, S. Papadimitriou, P.S. Yu, and C. Faloutsos. Incremental tensor
analysis: Theory and applications. ACM Transactions on Knowledge Discovery from
Data, 2(3), October 2008.
[286] B.J. Swihart, B. Caffo, B.D. James, M. Strand, B.S. Schwartz, and N.M. Punjabi.
Lasagna plots: A saucy alternative to spaghetti plots. Epidemiology, 21(5):621–625,
2010.
[287] B.J. Swihart, A.J. Goldsmith, and C.M. Crainiceanu. Restricted likelihood ratio tests
for functional effects in the functional linear model. Technometrics, 56(4):483–493,
2014.
[288] B.J. Swihart, N.M. Punjabi, and C.M. Crainiceanu. Modeling sleep fragmentation in
sleep hypnograms: An instance of fast, scalable discrete-state, discrete-time analyses.
Computational Statistics and Data Analysis, 89:1–11, 2015.
[289] M. Taylor, M. Losch, M. Wenzel, and J. Schröter. On the sensitivity of field re-
construction and prediction using empirical orthogonal functions derived from gappy
data. Journal of Climate, 07 2013.
[290] C.D. Tekwe, R.S. Zoh, F.W. Bazer, G. Wu, and R.J. Carroll. Functional multiple
indicators, multiple causes measurement error models. Biometrics, 74(1):127–134,
2018.
[291] C.D. Tekwe, R.S. Zoh, M. Yang, R.J. Carroll, G. Honvoh, D.B. Allison, M. Ben-
den, and L. Xue. Instrumental variable approach to estimating the scalar-on-function
regression model with measurement error with application to energy expenditure as-
sessment in childhood obesity. Statistics in medicine, 38(20):3764–3781, 2019.
[292] O. Theou, J.M. Blodgett, J. Godin, and K. Rockwood. Association between sedentary
time and mortality across levels of frailty. Canadian Medical Association Journal,
189(33):E1056–E1064, 2017.
Bibliography 309
[293] T.M. Therneau. A Package for Survival Analysis in R, 2020. R package version 3.2-7.
[294] A.N. Tikhonov. Solution of incorrectly formulated problems and the regularization
method. Soviet Mathematics Doklady, 1963.
[295] T.M. Therneau and P.M. Grambsch. Modeling Survival Data: Extending the Cox
Model. Springer, New York, 2000.
[296] R.P. Troiano, D. Berrigan, K.W. Dodd, L.C. Mâsse, T. Tilert, and M. McDowell.
Physical activity in the united states measured by accelerometer. Medicine & Science
in Sports & Exercise, 40(1):181–188, 2008.
[297] A.A. Tsiatis. A large sample study of Cox’s regression model. The Annals of Statistics,
9(1):93–108, 1981.
[298] G. Tutz and J. Gertheiss. Feature extraction in signal regression: A boosting technique
for functional data regression. Journal of Computational and Graphical Statistics,
19(1):154–174, 2010.
[299] S. Ullah and C.F. Finch. Applications of functional data analysis: A systematic review.
BMC Medical Research Methodology, 13(43), 2013.
[300] M. Valderrama, F. Ocaña, A. Aguilera, and F. Ocaña-Peinado. Forecasting pollen
concentration by a two-step functional model. Biometrics, 66:578–585, 08 2010.
[301] G. Verbeke and G. Molenberghs. Linear Mixed Models for Longitudinal Data.
Springer, New York, 2000.
[302] A. Volkmann, A. Stöcker, F. Scheipl, and S. Greven. Multivariate functional additive
mixed models. Statistical Modelling, 0(0):1471082X211056158, 2023.
[303] G. Wahba. Bayesian “Confidence Intervals” for the Cross-Validated Smoothing Spline.
Journal of the Royal Statistical Society: Series B, 45(1):133–150, 1983.
[304] G. Wahba. Spline Models for Observational Data (CBMS-NSF Regional Conference
Series in Applied Mathematics, Series Number 59). Society for Industrial and Applied
Mathematics, 1990.
[305] G. Wahba, Y. Wang, C. Gu, R. Klein, and B. Klein. Smoothing spline ANOVA
for exponential families, with application to the Wisconsin Epidemiological Study of
Diabetic Retinopathy: The 1994 Neyman Memorial Lecture. The Annals of Statistics,
23(6):1865–1895, 1995.
[306] M. Wand and J. Ormerod. On O’Sullivan penalised splines and semiparametric re-
gression. Australian & New Zealand Journal of Statistics, 50:179–198, 06 2008.
[307] J.-L. Wang, J.-M. Chiou, and H.-G. Müller. Functional data analysis. Annual Review
of Statistics and Its Application, 3(1):257–295, 2016.
[308] W. Wang. Linear mixed function-on-function regression models. Biometrics,
70(4):794–801, 2014.
[309] X. Wang, S. Ray, and B.K. Mallick. Bayesian curve classification using wavelets.
Journal of the American Statistical Association, 102(479):962–973, 2007.
[310] J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58(301):236–244, 1963.
310 Bibliography
[311] H. Wickham and G. Grolemund. R for Data Science: Import, Tidy, Transform, Vi-
sualize, and Model Data. O’Reilly Media, 1 edition, 2017.
[312] R.K.W. Wong and T.C.M. Lee. Matrix completion with noisy entries and outliers.
Journal of Machine Learning Research, 18(147):1–25, 2017.
[313] S.N. Wood. Thin plate regression splines. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 65(1):95–114, 2003.
[314] S.N. Wood. Stable and efficient multiple smoothing parameter estimation for gener-
alized additive models. Journal of the American Statistical Association, 99(467):673–
686, 2004.
[315] S.N. Wood. Generalized Additive Models: An Introduction with R. Chapman and
Hall/CRC, 2006.
[316] S.N. Wood. On confidence intervals for generalized additive models based on penalized
regression splines. Australian & New Zealand Journal of Statistics, 48(4):445–464,
2006.
[317] S.N. Wood. Fast stable restricted maximum likelihood and marginal likelihood esti-
mation of semiparametric generalized linear models. Journal of the Royal Statistical
Society, Series B, 73(1):3–36, 2011.
[318] S.N. Wood. On p-values for smooth components of an extended generalized additive
model. Biometrika, 100(1):221–228, 10 2012.
[319] S.N. Wood. Generalized Additive Models: An Introduction with R. Second Edition.
Chapman and Hall/CRC, 2017.
[320] S.N. Wood, Y. Goude, and S. Shaw. Generalized additive models for large datasets.
Journal of the Royal Statistical Society, Series C, 64(1):139–155, 2015.
[321] S.N. Wood, Z. Li, G. Shaddick, and N.H. Augustin. Generalized additive models for
gigadata: modelling the UK black smoke network daily data. Journal of the American
Statistical Association, 112(519):1199–1210, 2017.
[322] S.N. Wood, N. Pya, and B. Safken. Smoothing parameter and model selection for gen-
eral smooth models (with discussion). Journal of the American Statistical Association,
111:1548–1575, 2016.
[323] J. Wrobel. register: Registration for exponential family functional data. Journal of
Open Source Software, 3(22):557, 2018.
[324] J. Wrobel. fastGFPCA: Fast generalized principal components analysis, 2023.
https://github.com/julia-wrobel/fastGFPCA.
[326] Y. Wu, J. Fan, and H.G. Müller. Varying-coefficient functional linear regression.
Bernoulli, 16(3):730–758, 2010.
[327] L. Xiao, L. Huang, J.A. Schrack, L. Ferrucci, V. Zipunnikov, and C.M. Crainiceanu.
Quantifying the lifetime circadian rhythm of physical activity: A covariate-dependent
functional approach. Biostatistics, 16(2):352–367, 10 2014.
Bibliography 311
[328] L. Xiao, C. Li, W. Checkley, and C.M. Crainiceanu. Fast covariance estimation for
sparse functional data. Statistics and Computing, 28(3):511–522, 2018.
[329] L. Xiao, C. Li, W. Checkley, and C.M. Crainiceanu. face: Fast Covariance Estimation
for Sparse Functional Data, 2021. R package version 0.1-6.
[330] L. Xiao, Y. Li, and D. Ruppert. Fast bivariate P-splines: The sandwich smoother.
Journal of the Royal Statistical Society, Series B (Methodological), pages 577–599,
2013.
[331] L. Xiao, V. Zipunnikov, D. Ruppert, and C.M. Crainiceanu. Fast covariance estima-
tion for high-dimensional functional data. Statistics and Computing, 26(1):409–421,
2016.
[332] M. Xu and P.T. Reiss. Distribution-free pointwise adjusted p-values for functional
hypotheses. In G. Aneiros, I. Horová, M. Hušková, and P. Vieu, editors, Functional
and High-Dimensional Statistics and Related Fields, pages 245–252. Springer, 2020.
[333] Y. Xu, Y. Li, and D. Nettleton. Nested hierarchical functional data modeling and
inference for the analysis of functional plant phenotypes. Journal of the American
Statistical Association, 113(522):593–606, 2018.
[334] F. Yao, H.-G. Müller, and J.-L. Wang. Functional data analysis for sparse longitudinal
data. Journal of the American Statistical Association, 28(100):577–590, 2005.
[335] F. Yao, H.-G. Müller, and J.-L. Wang. Functional linear regression analysis for lon-
gitudinal data. Annals of Statistics, 33(6):2873–2903, 2005.
[336] F. Yao, H.G. Müller, A.J. Clifford, S.R. Dueker, J. Follett, Y. Lin, B.A. Buchholz,
and J.S. Vogel. Shrinkage estimation for functional principal component scores with
application to the population kinetics of plasma folate. Biometrics, 59(3):676–685,
2003.
[337] Y. Yuan, J.H. Gilmore, X. Geng, S. Martin, K. Chen, J.-L. Wang, and H. Zhu. Fmem:
Functional mixed effects modeling for the analysis of longitudinal white matter tract
data. NeuroImage, 84:753–764, 2014.
[338] D. Zhang, X. Lin, and M. Sowers. Two-stage functional mixed models for evaluating
the effect of longitudinal covariate profiles on a scalar outcome. Biometrics, 63(2):351–
362, 2007.
[339] Y. Zhang and Y. Wu. Robust hypothesis testing in functional linear models. Journal
of Statistical Computation and Simulation, 0(0):1–19, 2023.
[340] Y. Zhao, R.T. Ogden, and P.T. Reiss. Wavelet-based lasso in functional linear regres-
sion. Journal of Computational and Graphical Statistics, 21(3):600–617, 2012.
[341] X. Zhou, J. Wrobel, C.M. Crainiceanu, and A. Leroux. Generalized multilevel func-
tional principal component analysis. Under Review, 2023.
[342] H. Zhu, P.J. Brown, and J.S. Morris. Robust, adaptive functional regression in func-
tional mixed model framework. Journal of the American Statistical Association,
106(495):1167–1179, 2011.
[343] H. Zhu, P.J. Brown, and J.S. Morris. Robust classification of functional and quanti-
tative image data using functional mixed models. Biometrics, 68(4):1260–1268, 2012.
312 Bibliography
[344] H. Zhu, K. Chen, X. Luo, Y. Yuan, and J.-L. Wang. Fmem: Functional mixed effects
models for longitudinal functional responses. Statistica Sinica, 29(4):2007, 2019.
313
314 Index
scatter plots 46 P
survival analysis 213ff PA see Physical activity (PA) data in
Natural cubic splines 38 NHANES
NCHS (National Center for Health PACE 180
Statistics) see National Center for pairwise.complete.obs 58–60, 73–4
Health Statistics (NCHS) Parametric bootstrap approximation 137ff
NDI (National Death Index) see National analytic solution 206ff, 229–31
Death Index (NDI) Parametric form 37
Neighborhood size effects 80 Parametric model saturated vs.
nhames mean alive 45 parsimonious 41
NHANES see National Health and PAXFLGSM 3
Nutrition Examination Survey PAXPREDM 3
(NHANES) PCA see Principal component analysis
nhanes boot fui 172 (PCA)
Nhanes fda with r wide format data PCs see Principal components (PCs)
structure 21 Penalized cyclic cubic regression spline 45
nlme xii, 86 Penalized estimation 39
Noise component consideration 253 Penalized functional Cox model 220ff
Noise contamination 55 Penalized functional Cox regression xiii
Noise in NHANES analysis 75–6 Penalized functional regression (PFR)
Noise variables assumption 1 application 262ff
Noisy data and FPCA smoothing and Penalized function-on-function framework
clustering 285–7 175
Non-Gaussian functional data 76ff Penalized log likelihood 129
FPCA-like methods xiii Penalized models treated as mixed effects
outcomes 113 models xiii
Non-parametric bootstrap 204ff Penalized smoothing 20
Nonparametric bootstrap 51 Penalized spline estimation 113ff
Nonparametric modeling development xi of FoFR 178ff
Nonparametric smoothing 36–7 Penalized splines 40–3
Non-penalized approaches fit 129 B-splines characteristics xi
Non-spline bases 38 Penalized spline smoothing in NHANES
Normal approximation 141ff data 44–7
Normalized functional data 25 Penalized sum of squares 41
Notation conventions 23–4 Percent variance explained (PVE) 249
nrow(dati) 94 Permutation F-tests 53
Peruvian child growth study 15ff
O pffr xii
Obesity paradox 219 in CONTENT child growth study 190ff
Observed function properties 2 features of 187ff
Observed functions 1 in refund package 166ff, 181ff
OLS see Ordinary least squares (OLS) pfr xii, 124, 127ff
OPTICS 268 limitation of 138
optim 58–60 penalized models 129
Ordering property definition 2 PFR see Penalized functional regression
Ordinary least squares (OLS) estimation (PFR) application
153ff Phonetic analysis FoSR application 143
Orthonormal covariates 66 Physical activity (PA) data in NHANES
Overfitting 44–50, 74ff
avoidance 67 Physical activity analysis 101
smoothing penalty to avoid 259 Physical activity as intervention target 219
Index 321