Zlib - Pub - Modern Statistical Methods For Spatial and Multivariate Data
Zlib - Pub - Modern Statistical Methods For Spatial and Multivariate Data
Norou Diawara Editor
Modern
Statistical
Methods for
Spatial and
Multivariate Data
STEAM-H: Science, Technology, Engineering,
Agriculture, Mathematics & Health
STEAM-H: Science, Technology, Engineering,
Agriculture, Mathematics & Health
Series Editor
Bourama Toni
Department of Mathematics
Howard University
Washington, DC, USA
This interdisciplinary series highlights the wealth of recent advances in the pure
and applied sciences made by researchers collaborating between fields where
mathematics is a core focus. As we continue to make fundamental advances in
various scientific disciplines, the most powerful applications will increasingly be
revealed by an interdisciplinary approach. This series serves as a catalyst for these
researchers to develop novel applications of, and approaches to, the mathematical
sciences. As such, we expect this series to become a national and international
reference in STEAM-H education and research.
Interdisciplinary by design, the series focuses largely on scientists and math-
ematicians developing novel methodologies and research techniques that have
benefits beyond a single community. This approach seeks to connect researchers
from across the globe, united in the common language of the mathematical sciences.
Thus, volumes in this series are suitable for both students and researchers in a variety
of interdisciplinary fields, such as: mathematics as it applies to engineering; physical
chemistry and material sciences; environmental, health, behavioral and life sciences;
nanotechnology and robotics; computational and data sciences; signal/image pro-
cessing and machine learning; finance, economics, operations research, and game
theory.
The series originated from the weekly yearlong STEAM-H Lecture series
at Virginia State University featuring world-class experts in a dynamic forum.
Contributions reflected the most recent advances in scientific knowledge and were
delivered in a standardized, self-contained and pedagogically-oriented manner to a
multidisciplinary audience of faculty and students with the objective of fostering
student interest and participation in the STEAM-H disciplines as well as fostering
interdisciplinary collaborative research. The series strongly advocates multidis-
ciplinary collaboration with the goal to generate new interdisciplinary holistic
approaches, instruments and models, including new knowledge, and to transcend
scientific boundaries.
123
Editor
Norou Diawara
Department of Mathematics and Statistics
Old Dominion University
Norfolk, VA, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Statistical ideas and concepts have increasing impacts at many levels. From H.G.
Wells’ 1903 book Mankind in the Making, a quote paraphrased by Sam Wilks in his
1950 American Statistical Association speech states: “Statistical thinking will one
day be as necessary for efficient citizenship as the ability to read and write.” Current
disciplinary boundaries encourage interaction between scientists and the sharing
of information and educational resources. Researchers from these interdisciplinary
fields will find this book an important resource for the latest statistical methods for
spatial and multivariate data.
Given the increasingly complex data world we live in, this volume takes
on unique approach with respect to methodology, simulation, and analysis. The
environment provides the perfect setting for exciting opportunities and interdisci-
plinary research and for practical and robust solutions, contributing to the science
community in large. The National Institutes of Health and the Howard Hughes
Medical Institute have strongly recommended that undergraduate biology education
should incorporate mathematics and statistics, physics, chemistry, computer science,
and engineering until “interdisciplinary thinking and work become second nature.”
In that sense, this volume is playing an ever more important role in the physical and
life sciences, engineering and technology, data sciences, and artificial intelligence,
blurring the boundaries between scientific disciplines.
The shared emphasis of these carefully selected and refereed contributed chapters
is on important methods, research directions, and applications of analysis including
within and beyond mathematics and statistics. As such the volume promotes statis-
tical sciences and their applications to physical, life, and data sciences. Statistical
methods for spatial and multivariate data have gained indeed tremendous interest
over the last decades and are rapidly expanding. This book features recent advances
in statistics to include the spatio-temporal aspects, classification techniques, the
multivariate outcomes with zero and doubly inflated data, the copula distributions,
the wavelet kernels for support matrix machines, and feasible algorithmic solutions.
Events are sometimes affected by a set of covariates accounted in space locations
and times. With the influx of big data, statistical tools are identified, tested, and
v
vi Preface
improved to fill in the gaps sometimes found in the environmental, financial, and
healthcare fields.
This volume stretches our boundaries of knowledge for this fascinating and
ongoing area of research. It features the following chapters:
The chapter “Functional Form of Markovian Attribute-Level Best-Worst Dis-
crete Choice Modelling” by Amanda Working, Mohammed Alqawba, and Norou
Diawara provides modeling discrete choice experiments. The challenging parts can
be linked to the large number of covariates, issues with reliability, and the condition
that consumer behaviors is a forward evolving activity/practice. By extending the
idea of stationary process, the authors present a dynamic model with evaluation
under random utility analysis. The simulated and aggregated data examples show
the flexibility and wide applications of the proposed techniques.
The chapter “Spatial and Spatio-temporal Analysis of Precipitation Data from
South Carolina” from David Hitchcock, Haigang Liu, and S. Zahra Samadi presents
both spatial and spatio-temporal models for rainfall in South Carolina during a
period including one of the most destructive storms in state history. The models
proposed have allowed to determine several covariates that affect the rainfall and to
interpret their effects.
The chapter “A Sparse Areal Mixed Model for Multivariate Outcomes, with an
Application to Zero-Inflated Census Data” from Donald Musgrove, Derek S. Young,
John Hughes, and Lynn E. Eberly describes the multivariate sparse areal mixed
model (MSAMM) as an alternative to the multivariate conditional autoregressive
(MCAR) models. The MSAMM is capable of providing superior fit relative to
models provided under independent or univariate assumptions.
The next chapter “Wavelet Kernels for Support Matrix Machines” by Edgard
M. Maboudou-Tchao provides support vector machine techniques to the matrix-
based method support matrix machines (SMM), accepting matrix as input, and
then proposing new wavelet kernels for SMM. Such techniques are very powerful
approximations for nonstationary signals.
In the chapter “Properties of the Number of Iterations of a Feasible Solutions
Algorithm,” the authors Sarah A. Janse and Katherine L. Thompson provide
statistical guidance for the number of iterations by deriving a lower bound on the
probability of obtaining the statistically optimal model in a number of iterations of
algorithm along with the performances of the bound.
Classification techniques are commonly used by scientists and businesses alike
for decision-making. They involve assignment of objects (or information) to pre-
defined groups (or classes) using certain known characteristics such as classifying
emails as real or spam using information in the subject field. In the chapter “A
Primer of Statistical Methods for Classification,” the authors Rajarshi Dey and
Madhuri S. Mulekar describe two soft and four hard classifiers popularly used by
statisticians in practice. To demonstrate their applications, two simulated and three
real-life datasets are used to develop classification criteria. The results of different
classifiers are compared using misclassification rate and an uncertainty measure.
In the chapter entitled “A Doubly-Inflated Poisson Distribution and Regression
Model” by Manasi Sheth-Chandra, N. Rao Chaganty, and R. T. Sabo, doubly
Preface vii
inflated Poisson distribution is presented to account for count inflation at some value
k in addition to that seen at zero, while it was also incorporated into the generalized
linear models framework to account for associations with covariates.
The chapter “Multivariate Doubly Inflated Negative Binomial Distribution Using
Gaussian Copula” by Joseph Mathews, Sumen Sen, and Ishapathik Das presents a
model for doubly inflated count data using the negative binomial distribution, under
Gaussian copula methods. The authors also provide visuals of the bivariate doubly
inflated negative binomial model.
Moran’s Index is a statistic that measures spatial autocorrelation, quantifying the
degree of dispersion (or spread and properties) of components in some location/area.
Recognizing that a single Moran’s statistic may not give a sufficient summary of the
spatial autocorrelation measure, local spatial statistics have been gaining popularity.
Accordingly, the chapter “Quantifying Spatio-temporal Characteristics via Moran’s
Statistics” by Jennifer Matthews, Norou Diawara, and Lance Waller proposes to
partition the area and compute the Moran’s statistic of each subarea.
The book as a whole certainly enhances the overall objective of the series, that
is, to foster the readership interest and enthusiasm in the STEAM-H disciplines
(Science, Technology, Engineering, Agriculture, Mathematics, and Health), to
include statistical and data sciences, and to stimulate graduate and undergraduate
research through effective interdisciplinary collaboration.
The editor of the current volume is affiliated with the Department of Mathematics
and Statistics at Old Dominion University, Norfolk, Virginia. The department has
the unique distinction of being the only one in the Commonwealth of Virginia
Hampton area to offer B.S., M.S., and Ph.D. degrees in Computational and Applied
Mathematics, with an active research program supported by NASA, NSF, EVMS,
JLab, and the Commonwealth of Virginia.
ix
Contents
xi
Contributors
xiii
xiv Contributors
just the best as is done in the traditional DCEs. Besides having the information
of what the best and worst are for the respondents, Marley and Louviere (2005)
stated that BWS experiments provide information about the respondent’s ranking of
products.
BWS experiments are divided into three cases: the object case, the profile case,
and the multi-profile case (Louviere et al. 2015). In an object case, a list of objects,
scenarios, or attributes are given to respondents and the latter choose the best and
worst alternative. Unlike in traditional DCEs, no information about the object is
provided to the respondents. In the profile case which is also known as attribute-
level best-worst case, information or attributes about the alternatives are provided.
In this type of experiments, profiles composed of attribute-levels for each attribute
describing a product are determined. From the profiles, respondents are tasked
with choosing the best and the worst attribute-level pair. These experiments seek
to determine the extent to which attribute and their associated attribute-levels affect
consumer behavior. Furthermore, in attribute-level best-worst DCEs the levels of the
attributes are well defined and vary across profiles or products, providing sufficient
information to measure their impact. For example, Knox et al. (2012) study women’s
contraceptives by using unbalanced design, with seven attributes (product, effect
on acne, effect on weight, frequency of administration, contraceptive effectiveness,
effect on bleeding, and cost) with associated number of attribute-levels 8, 3, 4, 4,
8, 9, and 6, respectively. The attribute effect on acne had three levels (worsens,
improves, or no effect). Finally, the multi-profile case most closely resembles the
traditional DCEs in such that a set of profiles describing products are provided to
the respondents and the respondents choose the best and worst products from the
choice set.
In this chapter, extension to the existing work done on partial profile models for
attribute-level best-worst DCEs, or profile-case BWS, is presented. Attribute-level
best-worst data are presented as indicator functions demonstrating the equivalence
of these models to the traditional attribute-level best-worst models. The indicator
functions are then generalized providing an alternative method for accounting for
the attributes of attribute-level models. The functional form of the data definition
provides an adaptive model able to conform changes in the profile or set of attributes
under discrete choice modeling (DCM). We also allow changes in decisions/utilities
over time under Markov decision process (MDP). The conditional logit model is
used in the DCMs.
The chapter is organized as follows: attribute-level best-worst DCMs are intro-
duced in Sect. 2. In Sect. 3, functional form of attribute-level best-worst DCMs is
presented. Section 4 considers Markov decision process (MDPs) with regard to
time sensitive attribute-level best-worst DCEs. Simulated data example of functional
form of Markovian attribute-level best-worst DCMs over time and results are
described in Sect. 5. We end with a conclusion in Sect. 6.
Functional Form of Markovian Attribute-Level Best-Worst Discrete Choice Modeling 3
Discrete choice experiments (DCEs) and their modeling describe consumer behav-
iors. Given a set of descriptors about a product, one can estimate the probability an
alternative is chosen provided a statistical model appropriate to the data. However,
these models are limited in the information they provide. According to Lancsar et al.
(2013), there exist only two ways to gain more information from traditional DCEs:
either increase the sample size or increase the number of choice sets evaluated by
respondents with the risk of adding burden on the respondents in the experiments.
Alternatively, Louviere and Woodworth (1991) and Finn and Louviere (1992)
presented best-worst scaling experiments that are modified DCEs designed to elicit
more information about choice behavior than the pick one approach implemented
in the traditional DCEs without the added burden on the respondents.
Although the experiments were presented in the early 1990s, it was not until
Marley and Louviere (2005) that the mathematical probabilities and properties were
formally presented. Marley and Louviere (2005), Marley et al. (2008), and Marley
and Pihlens (2012) provided the probability and properties to best-worst scaling
experiments for the profile case and in multi-profile version; however, utility was not
introduced. Additionally, Lancsar et al. (2013) provided the probability and utility
definition for the multi-profile experiments that include the sequential best-worst
choice from a set of choices. Louviere and Woodworth (1983) stated that orthogonal
main effects and fractional factorial designs provide better parameter estimates than
other designs. In application to best-worst scaling experiments, balanced incomplete
block designs (BIBD) (Louviere et al. 2013; Parvin et al. 2016) and orthogonal main
effects plans (OMEPs) are popular designs (Flynn et al. 2007; Knox et al. 2012;
Street and Knox 2012). These designs and their properties are examined by Street
and Burgess (2007). Louviere et al. (2013) looked at the design of experiments for
best-worst scaling experiments and stated that it is possible to determine individual
parameter estimates for the respondents.
This chapter focuses on the profile case, also known as attribute-level best-worst
DCEs. These experiments seek to determine the extent to which attributes and their
associated attribute-levels impact consumer behavior. Louviere and Timmermans
(1990) introduced hierarchical information integration (HII) for the examination
of the valuation of attributes in DCEs. Under HII, the impact of an attribute
necessitates discerning the various levels of the attribute. An experiment must
be designed in such a way that can measure the different levels varying across
products and determine such an impact. In attribute-level best-worst DCEs, the
levels of attributes are well defined and vary across profiles, or products, providing
sufficient information to measure their impact. Attribute-level best-worst discrete
choice experiments provide more information into consumer’s choices of products
than the usual discrete choice experiments and add more value to the understanding
4 A. Working et al.
of the data (Marley and Louviere 2005). Those models outperform the standard logit
modeling in terms of goodness of fit as mentioned in Hole (2011) in the context of
attribute attendance.
Understanding the impact attribute and attribute-levels have on utility is desir-
able. The guiding ideology in DCEs is that consumers behave in a way to maximize
utility. Understanding the impacts attributes and attribute-levels have on consumer
behavior provides information with regard to developing and advertising a product,
service, or policy to consumers. A preponderance of the literature on attribute-
level best-worst DCEs are empirical studies often in the area of health systems
research and marketing. Examples include Flynn et al. (2007) on seniors’ quality
of life, Coast et al. (2006) and Flynn et al. (2008) on dermatologist consultations,
Marley and Pihlens (2012) on cell phones, Knox et al. (2012, 2013) on choices in
contraceptives for women.
While there exists literature on attribute-level best-worst DCEs, it is rather scarce
compared to the work done on traditional DCEs. In this section, we provide utility
definition and the resulting choice probabilities and properties. We use the utility
definition and choice probabilities to extend the work done by Grossmann et al.
(2009) to fit models on a function of the attributes and attribute-levels to reflect
fluctuation that are inherent in DCE over time.
Attribute-level best-worst scaling are modified DCEs designed to elicit the impact
the attributes and attribute-levels have on the utility of a product. As mentioned
by Louviere and Timmermans (1990), an experiment must be designed in a way
to evaluate combinations of attribute-levels to obtain information about attribute
impacts on utility. Best-worst attribute-level DCEs provide such an experimental
design to attain these impacts.
In the attribute-level best-worst DCEs, each product is represented by a profile
x = (x1 , x2 , . . . , xK ), where xk is the attribute-level for the kth attribute Ak
that makes up the product with k = 1, 2, . . . , K. The attribute-levels take values
from 1 to lk for k = 1, 2, . . . , K. The number of possible profiles is given by
Kk=l li . Full factorial designs are generally not used due to the large number of
profiles. Alternatively, OMEP designs are promoted in the literature as efficient and
optimal provide sufficient information to estimate model parameters (Louviere and
Woodworth 1983; Street and Burgess 2007; Street and Knox 2012).
In these experiments, respondents are tasked with choosing a pair of attribute-
levels that contains the best and the worst attribute-level for a given profile. For
every profile, the choice set is then:
where the first attribute-level is considered to be the best and the second is the worst.
From the profile Cx , the respondent determines from the τ = K(K − 1) choices
given which is the best-worst pair.
In our setup, we extend the state of choices as follows. Let there be G profiles
and the associated profiles given as:
The corresponding choice sets for the G profiles are given in Fig. 1. To simplify the
notation, we may interchange C1 with Cx1 , C2 with Cx2 . . . , and CG with CxG .
The total number of attribute-levels is L = ki=1 li , and J = K k=1 lk (L − lk ) is
the total number of unique attribute-level pairs in the experiment (Street and Knox
2012). Within each of the G choice sets, thereare τ = K(K − 1) choice pairs.
In the experiment, there is a total of J = K k=1 lk (L − lk ) alternatives. However,
within a choice set there is a total of τ = K(K − 1) choices in each of the G choice
sets evaluated. Each respondent will have made G choices within the experiment.
The response variable representing the choices within each of the choice sets for the
for i = 1, 2, . . . , G, s = 1, 2, . . . , n and j = 1, 2, . . . , τ .
For the attribute-level best-worst DCEs, the data, X, is composed of indicators for
the best and worst attributes and attribute-levels. Consider the choice pair (xij , xij )
from the choice set Ci , for i = 1, 2, . . . , G, j = j , j, j = 1, 2,
. . . , K, and
1 ≤ xij ≤ lj . Let X be the J × p design matrix, where p = K + K k=1 lk . The
rows of X correspond to the possible choice pairs. Let XA1 , XA2 , . . . , XAK be the
data corresponding to the attributes Ak , k = 1, 2, . . . , K. Then,
⎧
⎪
⎪ 1, if xij ∈ Ak for k = 1, 2, . . . , K,
⎨
XAk = −1, if xij ∈ Ak for k = 1, 2, . . . , K,
⎪
⎪
⎩ 0, otherwise.
Let XAk xik be the data for the attribute-level 1 ≤ xik ≤ lk within attribute
Ak , ∀k = 1, 2, . . . , K. Referring to the choice pair (xij , xij ), the corresponding
data for the attribute-levels are given by,
⎧
⎪
⎪ 1, if xij = xik ∈ Ak is the best attribute-level,
⎨
XAk xik = −1, if xij = xik ∈ Ak is the worst attribute-level,
⎪
⎪
⎩ 0, otherwise.
Marley and Louviere (2005) developed the probability theory for best-worst
scaling experiments including attribute-level best-worst DCEs. In attribute-level
best-worst DCEs, there are two components being modeled, the best choice and
the worst choice of attribute-levels from a profile xi , where i = 1, 2, . . . , G.
Under random utility theory (Marschak 1960), there are random utilities Uij
corresponding to the τ attribute-levels in the choice set and an individual chooses
an alternative with highest utility, i.e., they are not independent. Consider the
choice pair (xij , xij ), for i = 1, 2, . . . , G, j, j = 1, 2, . . . , K, and j = j .
According to Marley and Louviere (2005), the definition of utility consistent with
random utility theory satisfies, Uij = −Uij and Uijj = Uij − Uij for i =
1, 2, . . . , G, j, j = 1, 2, . . . , K, and j = j , and Uij = Vij + ij where Vij is
a systematic component and ij is an error term that distributed as type I extreme
value distribution (McFadden 1978). Hence, the definition of utility associated with
the best-worst choice pair under random utility theory is given by:
for i = 1, 2, . . . , G, j, j = 1, 2, . . . , K, and j = j .
Functional Form of Markovian Attribute-Level Best-Worst Discrete Choice Modeling 7
The definition of the utilities under the random utility model is unable to
be modeled under the conditional logit model due to the definition of the error
components (Marley and Louviere 2005).
If we assume that the random error terms are independently and identically
distributed type I extreme value distribution, or the Gumbel distribution, then the
choice probability comes directly from the conditional logit. The choice probability
is then,
where j = j , j, j = 1, 2, . . . , K, and i = 1, 2, . . . , G.
Since the error terms come from the type I extreme value distribution, their
difference is a logistic distribution. It follows from McFadden (1974) that the best-
worst attribute-level choice probability is defined by the conditional logit as:
exp(Vijj )
BWxi (xij , xij ) = , (2.4)
exp(Viqq )
(xiq ,xiq )∈Ci
where j = j , j, j = 1, 2, . . . , K, and i = 1, 2, . . . , G.
Marley et al. (2008) provide essential properties to the above probabilities. They
define the choice probability as:
b(xij )
b(xij )
BWxi (xij , xij ) = b(xij )
, (2.5)
b(xij )
∀ xij ,xij ∈Cxi ,j =j
where xij is chosen as the best attribute-level, and xij is the worst attribute-level,
and b is some positive scale function or impact of attribute for j = j , j, j =
1, 2, . . . , K, and i = 1, 2, . . . , G. Under the conditional logit, the scale function is
defined as b(xij ) = exp(Vij ), and the probability is as given in Eq. (2.4).
Essential properties of probability hold for Eq. (2.5), as
and
where j, j = 1, 2, . . . , K, j = j , and ∀i = 1, 2, . . . , G.
8 A. Working et al.
With such assumptions, the consumer is expected to select choices with higher
i . Attribute-level best-worst models
BWxi values. We denote BWxi (xij , xij ) as Pjj
are called maxdiff models because they maximize the difference in utility.
Associated properties of the maxdiff model mentioned in Marley et al. (2008)
are:
1. Invertibility: For profile i,
Pj j = Pqq Pq q ,
i i i i
Pjj
where j = j , j, j = 1, 2, . . . , K, and i = 1, 2, . . . , G.
We assume the error terms come from a type I extreme value distribution and use
the conditional logit model to estimate the p × 1 parameter vector,
The likelihood for estimating the model parameters based on a random sample n
individuals as in Eq. (2.1) is given as:
n
G
Y
L(β, Y) = Pijjisj . (2.11)
s=1 i=1 j =j
lk
i=1 βi = 0 (2.12)
or
i −1
l
βlk = − βj (2.13)
j =1
for all k = 1, 2, . . . , K (Street and Burgess 2007; Flynn et al. 2007; Grasshoff et al.
2003).
Next, the goal will be to build a functional form of the attributes and the attribute-
levels and estimate the associated model parameters that reflect their utilities.
The attribute-level best-worst DCEs are modified traditional DCEs. Models and
theory done for traditional DCEs have not been completely evaluated in terms of
best-worst scaling experiments. It is of interest to us to extend the model built on a
function of the data as presented in Grasshoff et al. (2003, 2004), and Grossmann
et al. (2009) to the attribute-level best-worst DCEs. On extending this work to these
experiments, we provide an additional way to define the systematic component that
provides flexibility not seen in traditional methods.
10 A. Working et al.
Considering functions of the attributes as they enter into the utility function is
not a new idea. Van Der Pol et al. (2014) present the systematic components of the
utility defined as linear functions, quadratic functions, or as stepwise functions of
the attributes. Grasshoff et al. (2013) define the functions as regression functions of
the attributes and attribute-levels in the model.
In the attribute-level best-worst DCEs, a set of G profiles, or products, are
examined. The profiles are given as xi = (xi1 , xi2 , . . . , xiK ), where xij is the
attribute-level in profile i = 1, 2, . . . , G that corresponds to the j th attribute, where
j = 1, 2, . . . , K. The choice task for respondents is to choose the best-worst pair
of attribute-levels. In the experiment, respondents make paired comparisons within
the profiles instead of between as in traditional DCEs.
In the attribute-level best-worst DCEs, the utility of the pairs is composed of
the utility corresponding to the best attribute-level and the worst attribute-level. The
regression functions presented in Grasshoff et al. (2003) are applied to the attributes
and attribute-levels within the respective systematic components. Let f be the set
of regression functions for the best attribute-levels in the pairs, and g the set of
regression functions for the worst attribute-levels in the pairs. The p × 1 parameter
vector β still must satisfy the identifiability condition given in Eq. (2.13).
As noted in Marley and Louviere (2005), the inverse random utility model must
be used so that the properties are met for the conditional logit model. Taking the
systematic component defined in Eq. (2.2), the functional systematic component for
the pair (xij , xij ) is defined as:
where j, j = 1, 2, . . . , K, j = j , and i = 1, 2, . . . , G.
The probability an alternative is chosen depends on the definition of the utility
and the distribution of the error terms. Referring back to Eq. (2.5) under the
conditional logit, the probability is
exp(Vijj )
BWxi (xij , xij ) =
exp(Viqq )
(xiq ,xiq )∈Ci
exp(Vij − Vij )
=
exp(Viq − Viq )
(xiq ,xiq )∈Ci
where i = 1, 2, . . . , G, j, j = 1, 2, . . . , K, and j = j .
The forms of the systematic components of the utilities as well as their associated
probabilities depend on the definition of the regression functions f and g. We define
the regression functions used in the traditional attribute-level best-worst model and
Functional Form of Markovian Attribute-Level Best-Worst Discrete Choice Modeling 11
extend the definition of the regression functions to a more general form that provides
flexibility in the model. We present that feasible version in the next subsection
followed by simulated example.
As presented earlier, best-worst DCEs take into account we provided the design,
probabilities, and properties associated with attribute-level best-worst pairs. The
data in these experiments are defined as a series of 1 s, 0 s, and −1 s corresponding
to the best and worst attributes and attribute-levels in a given choice pair. There exist
a set of functions f and g defined on the attribute-level pair that produces traditional
methods.
In the attribute-level best-worst DCEs, a set of G profiles, or products, are
examined. The profiles are given as xi = (xi1 , xi2 , . . . , xiK ), where xij is the
attribute-level in profile i = 1, 2, . . . , G that corresponds to the j th attribute for
j = 1, 2, . . . , K. Let us consider the choice is given as (xij , xij ), where j = j ,
j, j = 1, 2, . . . , K, and i = 1, 2, . . . , G. Let f be the set of regression functions
defined on the best attribute-level in a pair and g be the set of regression functions
defined on the worst attribute-level in a pair.
In the traditional attribute-level best-worst DCE, the regression functions f and g
are defined as indicator functions. The indicator functions are p × 1 vectors. For the
attributes, they are defined as,
1, if xij ∈ Ak ,
IAk (xij ) = (3.3)
0, otherwise,
where j, k = 1, 2, . . . , K and i = 1, 2, . . . , G.
The best-worst systematic component for the pair (xij , xij ) is given as,
lk
= ⎣IAk (xij )βAk + IAk xkj (xij )βAk xkj ⎦
k=1 j =1
12 A. Working et al.
⎡ ⎤
lk
− ⎣IAk (xij )βAk + IAk xkj (xij )βAk xkj ⎦
k=1 j =1
K
lk
lk
+ IAk xkj (xij )βAk xkj − IAk xkj (xij )βAk xkj
j =1 j =1
= IAj βAj − IAj βAj + IAj xij βAj xij − IAj xij βAj xij , (3.5)
where j, j = 1, 2, . . . , K, j = j , and i = 1, 2, . . . , G.
Rewriting the indicator functions of the Ak and Ak xk , a more general form of the
regression functions can be defined. Let bAk and bAk xk be constants corresponding
to the best attribute and attribute-levels in a pair, and wAk and wAk xk be constants
corresponding to the worst attribute and attribute-levels in a pair, where xk =
1, 2, . . . , lk and k = 1, 2, . . . , K. The regression functions f and g are given as,
⎡ ⎤
lk
f(xij ) = ⎣bAk IAk (xij ) + bAkxk IAk xk (xij )⎦ (3.6)
k=1 j =1
and
⎡ ⎤
lk
g(xij ) = − ⎣wAk IAk (xij ) + wAkxk IAk xk (xij )⎦ (3.7)
k=1 j =1
where j, j = 1, 2, . . . , K, j = j , and i = 1, 2, . . . , G.
The above functions are simple linear process which can be used to model
the attribute-level best-worst DCEs. Furthermore, the dependence, or functional
dependence, can be extended by considering
⎡ ⎤
lk
f(xij ) = ⎣fk (Ak ) + fk,j (Ak xkj )⎦ . (3.8)
k=1 j =1
and
⎡ ⎤
lk
g(xij ) = − ⎣gk (Ak ) + gk,j (Ak xkj )⎦ . (3.9)
k=1 j =1
Functional Form of Markovian Attribute-Level Best-Worst Discrete Choice Modeling 13
where fk , gk , fk,j , and gk,j can be linear, nonlinear, or kernel based functional form
for best and worst attributes and attributes-levels, respectively.
The regression functions defined in this way provide flexibility than the tradi-
tional attribute-level best-worst DCEs. Consumer preference in products is con-
stantly changing, new information about the product comes to light or as trends
come and go. Hence, the data collected on a product may be dynamic. The addition
of these constants to the regression functions provides researchers the ability to
scale the data to reflect current trends or changes in the products. For example,
let us consider the products being modeled are pharmaceuticals such as in the
contraceptives as proposed in Knox et al. (2012, 2013). If new information about
a brand of contraceptives posing a health risk was discovered, then using regression
functions, it is possible to update the model to reflect this change. Assuming the
change is to remove the brand. The attribute-level associated with the brand may
have bkxk = wkxk = 0, where xk = 1, 2, . . . , lk and k = 1, 2, . . . , K to represent its
removal from the market. For all the pairs this attribute-level was in, the information
the choice pair provides in terms of the other attributes and attribute-levels would
remain intact. The model would be estimated again and the parameter vector,
β, would provide the updated impact of the attributes and attribute-levels in the
experiment.
infinite horizon. For the purpose of this chapter, our interest is in discrete time,
finite horizon MDPs, that is t = 1, 2, . . . , T where T < ∞. Numerical methods
such as dynamic programming are used to estimate the expected rewards for this
type of MDPs.
As the decision process is Markovian, the transition probability to the next state
st+1 based solely on the decision made at the current state, st , is p(st+1 |st ), where
t = 1, 2, . . . , T (Puterman 2014). The transition probabilities are the drivers of this
sequential decision-making process. The decision process maps the movement from
one state to another over time, t, based on rewards received and the optimal decision
set. The optimal decision rule is known as the policy, δ = (d1∗ , d2∗ , . . . , dT∗ ), where
dt∗ is the decision at time t = 1, 2, . . . , T that yields the maximum expected reward
(Puterman 2014).
While there exists some literature on the application of MDPs in traditional
DCEs, we have not encountered any work in the literature to extend these methods
to best-worst scaling DCEs. In this chapter, we extend the use of MDPs to Case 2
of best-worst scaling models, the attribute-level best-worst DCEs.
In traditional MDPs, the value functions are computed for each of the J
alternatives, or products. At each time point, t = 1, 2, . . . , T , the decision dt is to
choose the alternative that provides the maximum expected utility given information
about the state st = (xt , t ), where xt is the set of K attributes. The decision made
is between alternatives in the traditional DCEs. In attribute-level best-worst DCEs,
the experiments model choices within products not between products.
In attribute-level best-worst DCEs, there are K attributes describing a product
each with lk levels, where k = 1, 2, . . . , K. The total number of products in these
K
experiments is lk . The products are represented in the experiment by a profile.
k=1
The profile corresponding to the ith product is given as xi = (xi1 , xi2 , . . . , xiK ),
where xik is the attribute-level corresponding to the attribute Ak for k = 1, 2, . . . , K
for i = 1, 2, . . . , G. Within each choice set there are τ = K(K − 1) choices. A
respondent is asked to evaluate G choice sets in the experiment.
MDPs model the decision process for respondents over multiple time points.
For attribute-level best-worst DCEs, the model is built within the choice sets
corresponding to each of the G choice sets. In traditional DCEs, there are J
alternatives evaluated at each time point producing J value functions at each time
point. Attribute-level best-worst DCEs require a respondents to evaluate a series
of G choice sets each with τ choices, thus there are τ value functions for each
choice set in attribute-level best-worst MDPs. Our interest is to further model the
sequence of decisions made by introducing the time element into the experiments.
For attribute-level best-worst DCEs, we consider discrete time finite horizon MDPs
where:
• G choice sets are modeled across time.
• xtijj = (xij , xij ) are the attributes and attribute-levels corresponding to the
choices in set Ci , i = 1, 2, . . . , G, j = j , j, j = 1, 2, . . . , K, and t =
1, 2, . . . , T .
16 A. Working et al.
• The decision set depends on the choice set, called Di , and we evaluate dit ∈ Di ,
where i = 1, 2, . . . , G and 1 ≤ dti ≤ τ .
• The set of possible states in the experiment depends on the choice set, called Si ,
where sit = (xijt , xijt ) ∈ Si , where j = j , j, j = 1, 2, . . . , τ , i = 1, 2, . . . , G,
and 1 ≤ sti ≤ τ .
• Transition probabilities depend on a set of parameters θ that are assumed known,
or estimated from data (Arcidiacono and Ellickson 2011).
• Transition probability matrices, Pst s , are dependent on the choice set being
i i
evaluated.
In attribute-level best-worst DCEs, the MDPs model the choices in attribute-level
pairs within choice sets over time. Therefore, the transition probabilities and value
functions must be defined within the choice sets. Bellman (1954) utilized dynamic
programming to evaluate the value function, also known as Bellman’s equation, at
each time step. Rust (1994, 2008) presented the use of dynamic programming for
evaluating DCEs as MDPs.
The value function for DCEs defined by Bellman’s equation is given as:
T
Vt (xt , t ) = max Psst γ t −t U (xt , dt ) + (dt )|xt
dt ∈D
t =t
T
t −t
= max E γ U (xt , dt ) + (dt )|xt , t , (4.1)
dt ∈D
t =t
where t = 1, 2, . . . , T , U (sit , dit ) represents the utility associated with the state sit
and decision dit , and discount utility rate is given by γ ∈ (0, 1) and i = 1, 2, . . . , G.
The decision dit = (xij , xij ) is a choice pair within Ci , where i = 1, 2, . . . , G,
j, j = 1, 2, . . . , K, and j = j . In the attribute-level best-worst DCEs, there
will be τ = K(K − 1) value functions per each of the G choice sets. One of the
disadvantages of these experiments is the “curse of dimensionality” (Rust 2008). As
the number of attributes, attribute-levels, and profiles grow in the experiment, the
Functional Form of Markovian Attribute-Level Best-Worst Discrete Choice Modeling 17
4.1 Utility
where i = 1, 2, . . . , G, j, j = 1, 2, . . . , K, and j = j .
Referring back to Sect. 3, the systematic component is defined as a model built
on functions of the best and worst attribute-levels in the pair, using Eq. (3.1),
lk
ft (xij ) = ⎣bA
t
I (xij ) + t
bA I (xij )⎦ , (4.5)
k Ak k xk Ak xk
k=1 j =1
and
⎡ ⎤
lk
gt (xij ) = − ⎣wA
t
I (xij ) + t
wA I (xij )⎦ , (4.6)
k Ak k xk Ak xk
k=1 j =1
where j, j = 1, 2, . . . , K, j = j , and i = 1, 2, . . . , G.
Defining the systematic components according to the weighted function allows
the utility to change over time. We considered in Sect. 3 an example where an
attribute-level no longer exists in the future. The weighted functions of ft and gt
allowed us to update the parameter estimates, thus the utilities, using these weights.
It is conceivable in the future that an attribute-level scale may need to be adjusted for
possible bettering, worsening, or removal type of conditions for that attribute-level.
MDPs have infinitely many possible futures able to be considered in the simulations.
The definition of transition probabilities is the vehicle that drives the processes to
these different futures. However, determining transition probabilities for MDPs is a
difficult task. One way for estimating the transition probabilities is using maximum
likelihood estimates (MLEs). An empirical solution to the transition probabilities
may be determined by considering the transition probabilities as a multinomial
distribution (Lee et al. 1968).
In the attribute-level best-worst DCEs, there are τ choices within a choice set Ci ,
where i = 1, 2, . . . , G. There are τ states, and/or decisions, possible at each of the
Functional Form of Markovian Attribute-Level Best-Worst Discrete Choice Modeling 19
time points. The transition probabilities are denoted as Pss = P (st+1 = s |st = s),
where st , st+1 ∈ S and S = {1, 2, . . . , τ }. Let Ni be the respondents common to
time t and t + 1 in the experiment and niss be the number of respondents who chose
s at time t and s at t + 1, where t = 1, 2, . . . , T and i = 1, 2, . . . , G. The transition
choice probability is given by the multinomial distribution as:
Ni ! n n nisτ
f (pis1 , pis2 , . . . , pisτ ) = p is1 p is2 . . . pisτ , (4.7)
nis1 !nis2 ! . . . nisτ ! is1 is2
τ
where s = 1, 2, . . . , τ , i = 1, 2, . . . , G, piss ≥ 0, and piss = 1.
s =1
τ
Due to the constraint piss = 1, Lagrange multipliers, λ, are used and the
s =1
Lagrangian function is given as:
τ
G(pss ) = LL(piss ) − λ( piss − 1),
s =1
τ
where s = 1, 2, . . . , τ , i = 1, 2, . . . , G, piss ≥ 0, and piss = 1.
s =1
Taking the partial derivative of the Lagrangian to determine the MLEs gives us
τ
= piss where s = 1, 2, . . . , τ . Under the constraint,
niss
λ piss = 1, the value
s =1
τ
for s, s = 1, 2, . . . , τ and
niss
of λ = niss = Ni . Thus, the MLE for piss = Ni
s =1
i = 1, 2, . . . , G.
The MLE of piss is computationally simple; however, access to the information
needed to compute it may not always be available. To compute the MLE of this
nature, we would need to have respondents evaluate the same choice sets at two time
periods, which is not necessarily an easy task. Furthermore, this is considering the
transition matrix is stationary. It is possible to consider a dynamic transition matrix
for t = 1, 2, . . . , T . A transition matrix of this
that changes over time, that is pisst
nature would need to have multiple time periods of data for the same respondents
evaluating the same choice sets to compute the empirical probabilities. Instances
where multiple time periods of data for respondents are not possible, one must
consider alternative methods for determining the transition probabilities.
20 A. Working et al.
θ ts = (θsA
t
1
t
, θsA2
t
, . . . , θsAK
t
, θsA11
t
, . . . , θsAK lk
)
where a s are the time factor change and β s are fixed for i = 1, 2, . . . , G, 1 ≤ s ≤
τ , and t = 1, 2, . . . , T . The definition of the as (t) depends on the state s and time
t = 1, 2, . . . , T . We have considered asj (t) = asj t , where if |a | < 1 the impact of
sj
the attribute or attribute-level would be lessening with time, where j = 1, 2, . . . , K.
If asj (t)β̂j = asj t β̂ > 0, then the attribute or attribute-level has a positive impact
j
evolving at the rate asj t over time for j = 1, 2, . . . , K and t = 1, 2, . . . , T . A static,
= P t (Vijj
t
+ ijj > Vikk + ikk , ∀k = k ∈ Ci |si , θ si )
t t t t
= P t (ikk
t
< ijj + Vijj − Vikk , ∀k = k ∈ Ci |si , θ si ),
t t t t
(4.8)
where j = j , j, j = 1, 2, . . . , τ , i = 1, 2, . . . , G, and t = 1, 2, . . . , T . If we
assume the random error terms are independently and identically distributed as
type I extreme value distribution, the probability would then be found using the
conditional logit, and is given as:
P t (sijj |si , θ si ) = P (Uijj > Uikk , ∀k = k ∈ Ci |si , θ si )
t t t t t
t )
exp(Vijj
= t )
exp(Vikk
k,k ∈Ci
where j = j , k = k , j, j = 1, 2, . . . , τ , i = 1, 2, . . . , G, and t = 1, 2, . . . , T .
22 A. Working et al.
τ
where i = 1, 2, . . . , G, s, s = 1, 2, . . . , τ , and where t
Piss = 1. The
s =1
transition matrix may be either stationary or dynamic in nature. In our definition
of θ tsi , this is determined by the rate asi j (t), where i = 1, 2, . . . , G, 1 ≤ j ≤ p, and
t = 1, 2, . . . , T . In Sect. 5, we provide simulations under stationary and dynamic
transition probabilities and make comparisons.
5 Simulation Example
Table 2 Choice pairs with Best attribute Level Worst attribute Level Utility
the highest utility in the
experiment 2 1 1 1 12.3633
2 2 1 1 8.8012
3 4 1 1 7.6931
Table 3 Choice pairs with Best attribute Level Worst attribute Level Utility
the lowest utility in the
experiment 1 1 2 1 −9.2594
1 1 2 2 −6.5358
1 1 3 4 −5.7929
and worst 3 choice pairs along with their utilities are presented in Tables 2 and
3, respectively. The opposite of the pairs with the highest utilities have the lowest
utilities.
5.1 Scenario 1
We ran the simulation under this scenario with an advantageous proposed structure.
The intent is to validate/justify our relative performance over time under stationary
sparsity.
In this example, respondents are assumed to make similar decisions at each
decision epoch that they made at the previous time point. The transition parameters
θ tsi where sit = (xij , xij ) are defined as for the attributes as,
⎧
⎨ 1.7|βAk |, if xij ∈ Ak ,
θsti Ak = −1.7|βAk |, if xij ∈ Ak , (5.1)
⎩
βAk , otherwise,
K
every row of the transition matrix would be the same. Recall that p = K + lk =
k=1
12 is the number of parameters. We consider 1.7|βm | when a state or choice pair
at time t + 1 has the same best attribute and attribute-level as the state occupied
at time t, and −1.7|βm | when a state or choice pair at time t + 1 has the same
worst attribute and attribute-level as the state occupied at time t. We consider |βm | to
control the direction of the impact making sure it is positive for the best attribute and
attribute-level of si and use −|βm | to make sure its negative for the worst attribute
and attribute-level of si . We use 1.7 to increase the impact of the best and worst
attributes and attribute-levels of si . The definition of asi m (t) in this way insures that
states with common best and worst attributes and attribute-levels as the present state
occupied, sit = (xij , xij ), have a greater probability of being transitioned to, where
i = 1, 2, . . . , G, j = j , j, j = 1, 2, . . . , K, and t = 1, 2, . . . , T . The weights
associated with the attributes and attribute-levels are selected as: bA1 = wA1 = −2,
bA2 = wA2 = 5, bA3 = wA3 = 1, bA1 1 = wA1 1 = bA1 2 = wA1 2 = −2, bA2 1 =
wA2 1 = bA2 2 = wA2 2 = bA2 3 = wA2 3 = 5 and bA3 1 = wA3 1 = bA3 2 = wA3 2 =
bA3 3 = wA3 3 = bA3 4 = wA3 4 = 1.
Referring back to Sect. 4, the systematic component as a function of the best and
worst attribute-level in the pair is as in Eq. (4.4),
where ft and gt , as in Eqs. (4.5) and (4.6) with profile choice pairs shown in Fig. 2.
The value function/expected utilities for Profile 1 are displayed in Fig. 3, with
legend displayed in Fig. 2 along with the difference in the value functions over
time. Choice pair (x22 , x12 ), where x22 is the 2nd level of attribute 2 is the best and
x12 is the 2nd level of attribute 1 is the worst, is the choice with the highest expected
utility. The opposite pair (x12 , x22 ) is the worst choice pair. The pair (x34 , x22 ) has
a sharp drop between time t = 3 and t = 4 because of the change in the weights
applied to the attributes and attribute-levels from Eq. (3.1).
The model applied here views the attribute-level best-worst DCEs as sequential
leading to partial separation best-worst choices over time. Validity is guided by the
transition probabilities under Scenario 1, the participants follow the same choice
preferences. In Table 4, the transition probabilities are generally highest on the
Fig. 3 Expected discounted utility and their differences over time for Profile 1
26 A. Working et al.
diagonal and the same at each time period as we would expect in this setup. As
expected the trend in the utility is kept.
5.2 Scenario 2
In Scenario 2, respondents are allowed to make similar decisions at each time epoch
with a different rate of change, making the transition probabilities dynamic. The
transition parameters θ tsi where sit = (xij , xij ) are defined as for the attributes as,
⎧ t
⎨ 1.7 |βAk |, if xij ∈ Ak ,
θsti Ak = −1.7t |βAk |, if xij ∈ Ak , (5.3)
⎩
βAk , otherwise,
where j = j , j, j , k = 1, 2, . . . , K, 1 ≤ xk ≤ lk , and i = 1, 2, . . . , G.
We ran the simulation under this scenario with advantageous proposed hybrid
structure as shown above using the functional form as described in Scenario 1 with
profile choice pairs shown in Fig. 4. The transition matrix at time t = 1 is kept
the same as it was Scenario 1 in Table 4, and subsequent transition probabilities at
time t = 2, 3, and 4 are given in Tables 5, 6, and 7, respectively. The transition
probabilities are highest on the diagonal verifying the direction we wanted in the
transitions. The value function/expected utilities for Profile 1 are displayed in Fig. 5,
with legend displayed in Fig. 4 along with the difference in value functions. Choice
pair (x22 , x12 ), where x22 is the 2nd level of attribute 2 is the best and x12 is the 2nd
Functional Form of Markovian Attribute-Level Best-Worst Discrete Choice Modeling 27
level of attribute 1 is the worst, still remains the choice with the highest expected
utility as in Scenario 1. The opposite pair (x12 , x22 ) is the worst choice pair. We
also notice more shifts in expected utility than in the previous scenarios for Profile
1. Scaling the data makes the utilities shift in much more extreme values.
28 A. Working et al.
Fig. 5 Expected discounted utility and their differences over time for Profile 1
6 Conclusion
utility under identifiability constraints. Profile specific trends are displayed and
pattern behaviors are exhibited. We highlighted compelling situations that allow
shrinkage towards referenced choices and show efficiency data examples to make
inferences on the best-worst decisions of interest. Our simulated and aggregated data
examples show the flexibility and wide applications of our proposed techniques. Our
methodology is easily reproducible. The functional dependency and time evolving
structure may accommodate additional arrangements and setups.
A potential area of concern in the application of MDPs for attribute-level best-
worst DCEs is the curse of dimensionality as mentioned in Rust (2008). Since the
number of attributes, attribute-levels, and profiles grow quickly in the experiment,
the estimation process becomes exponentially more difficult. DCEs with larger
number of attributes and attribute-levels have more choice sets and pairs to model
across time. For discrete processes as is considered in the attribute-level best-worst
DCEs, the amount of information that needs to be stored becomes overwhelming.
The ability to guide the system becomes difficult due to the increased number of
states and choice sets considered. These issues should be considered when using
MDPs.
Extensions of this work may include interactions of choice pairs under different
correlation structures. The first order Markov dependency structure presented here
may be extended to higher order decision processes under stationary and dynamic
transition probabilities. Extensions to the continuous time scale case are being
explored.
References
Arcidiacono, P., Ellickson, P.B.: Practical methods for estimation of dynamic discrete choice
models. Annu. Rev. Econ. 3(1), 363–394 (2011)
Bellman, R.: The theory of dynamic programming. Bull. Am. Math. Soc. 60(6), 503–515 (1954)
Chadès, I., Chapron, G., Cros, M.-J., Garcia, F., Sabbadin, R.: Mdptoolbox: a multi-platform
toolbox to solve stochastic dynamic programming problems. Ecography 37(9), 916–920 (2014)
Coast, J., Salisbury, C., De Berker, D., Noble, A., Horrocks, S., Peters, T., Flynn, T.: Preferences
for aspects of a dermatology consultation. Br. J. Dermatol. 155(2), 387–392 (2006)
Finn, A., Louviere, J.J.: Determining the appropriate response to evidence of public concern: the
case of food safety. J. Public Policy Mark. 11(2), 12–25 (1992)
Flynn, T.N., Louviere, J.J., Peters, T.J., Coast, J.: Best–worst scaling: what it can do for health care
research and how to do it. J. Health Econ. 26(1), 171–189 (2007)
Flynn, T.N., Louviere, J.J., Marley, A.A., Coast, J., Peters, T.J.: Rescaling quality of life values
from discrete choice experiments for use as QALYs: a cautionary tale. Popul. Health Metr.
6(1), 6 (2008)
Grasshoff, U., Grossmann, H., Holling, H., Schwabe, R.: Optimal paired comparison designs for
first-order interactions. Statistics 37(5), 373–386 (2003)
Grasshoff, U., Grossmann, H., Holling, H., Schwabe, R.: Optimal designs for main effects in linear
paired comparison models. J. Stat. Plan. Inference 126(1), 361–376 (2004)
Grasshoff, U., Grossmann, H., Holling, H., Schwabe, R.: Optimal design for discrete choice
experiments. J. Stat. Plan. Inference 143(1), 167–175 (2013)
30 A. Working et al.
Grossmann, H., Grasshoff, U., Schwabe, R.: Approximate and exact optimal designs for paired
comparisons of partial profiles when there are two groups of factors. J. Stat. Plan. Inference
139(3), 1171–1179 (2009)
Hole, A.R.: A discrete choice model with endogenous attribute attendance. Econ. Lett. 110(3),
203–205 (2011)
Knox, S.A., Viney, R.C., Street, D.J., Haas, M.R., Fiebig, D.G., Weisberg, E., Bateson, D.: What’s
good and bad about contraceptive products? Pharmacoeconomics 30(12), 1187–1202 (2012)
Knox, S.A., Viney, R.C., Gu, Y., Hole, A.R., Fiebig, D.G., Street, D.J., Haas, M.R., Weisberg, E.,
Bateson, D.: The effect of adverse information and positive promotion on women’s preferences
for prescribed contraceptive products. Soc. Sci. Med. 83, 70–80 (2013)
Lancsar, E., Louviere, J., Donaldson, C., Currie, G., Burgess, L.: Best worst discrete choice
experiments in health: methods and an application. Soc. Sci. Med. 76, 74–82 (2013)
Lee, T.C., Judge, G., Zellner, A.: Maximum likelihood and Bayesian estimation of transition
probabilities. J. Am. Stat. Assoc. 63(324), 1162–1179 (1968)
Louviere, J., Timmermans, H.: Stated preference and choice models applied to recreation research:
a review. Leis. Sci. 12(1), 9–32 (1990)
Louviere, J.J., Woodworth, G.: Design and analysis of simulated consumer choice or allocation
experiments: an approach based on aggregate data. J. Mark. Res. 20(4), 350–367 (1983)
Louviere, J.J., Woodworth, G.G.: Best-worst scaling: a model for the largest difference judgments.
Working Paper. University of Alberta (1991)
Louviere, J.J., Hensher, D.A., Swait, J.D.: Stated Choice Methods: Analysis and Applications.
Cambridge University Press, Cambridge (2000)
Louviere, J., Lings, I., Islam, T., Gudergan, S., Flynn, T.: An introduction to the application of
(case 1) best–worst scaling in marketing research. Int. J. Res. Mark. 30(3), 292–303 (2013)
Louviere, J.J., Flynn, T.N., Marley, A.A.J.: Best-Worst Scaling: Theory, Methods and Applications.
Cambridge University Press, Cambridge (2015)
Luce, R.D.: On the possible psychophysical laws. Psychol. Rev. 66(2), 81 (1959)
Marley, A.A., Louviere, J.J.: Some probabilistic models of best, worst, and best–worst choices. J.
Math. Psychol. 49(6), 464–480 (2005)
Marley, A., Pihlens, D.: Models of best–worst choice and ranking among multiattribute options
(profiles). J. Math. Psychol. 56(1), 24–34 (2012)
Marley, A., Flynn, T.N., Louviere, J.: Probabilistic models of set-dependent and attribute-level
best–worst choice. J. Math. Psychol. 52(5), 281–296 (2008)
Marschak, J.: Binary choice constraints on random utility indicators. Technical Report 74 (1960)
McFadden, D.: Conditional logit analysis of qualitative choice behavior. In: Zarembka, P. (ed.)
Frontiers in Econometrics, pp. 105–142. Academic Press, New York (1974)
McFadden, D.: Modeling the choice of residential location. Transp. Res. Rec. 673, 72–77 (1978)
Parvin, S., Wang, P., Uddin, J.: Using best-worst scaling method to examine consumers’ value
preferences: a multidimensional perspective. Cogent Bus. Manag. 3(1), 1199110 (2016)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley,
New York (2014)
Rust, J.: Structural estimation of Markov decision processes. Handb. Econ. 4, 3081–3143 (1994)
Rust, J.: Dynamic programming. In: Durlauf, S.N., Blume, L.E. (eds.) The New Palgrave
Dictionary of Economics. Palgrave Macmillan, Ltd, London (2008)
Stenberg, F., Manca, R., Silvestrov, D.: An algorithmic approach to discrete time non-
homogeneous backward semi-Markov reward processes with an application to disability
insurance. Methodol. Comput. Appl. Probab. 9(4), 497–519 (2007)
Street, D.J., Burgess, L.: The Construction of Optimal Stated Choice Experiments: Theory and
Methods, vol. 647. Wiley, New York (2007)
Street, D.J., Knox, S.A.: Designing for attribute-level best–worst choice experiments. J. Stat.
Theory Pract. 6(2):363–375 (2012)
Thurstone, L.L.: A law of comparative judgment. Psychol. Rev. 34(4), 273 (1927)
Van Der Pol, M., Currie, G., Kromm, S., Ryan, M.: Specification of the utility function in discrete
choice experiments. Value Health 17(2), 297–301 (2014)
Spatial and Spatio-Temporal Analysis of
Precipitation Data from South Carolina
1 Introduction
and Salas (1985), Georgakakos and Kavvas (1987), Isaaks and Srivastava (1989),
Kumar and Foufoula-Georgiou (1994), Deidda (2000), Ferraris et al. (2003), Ciach
and Krajewski (2006), Berne et al. (2009), Ly et al. (2011), and Dumitrescu
et al. (2016) further advanced the application of geostatistical methods in rainfall
prediction. The theoretical basis of the geostatistical approach was strengthened
using Bayesian inference via the Markov Chain Monte Carlo (MCMC) algorithm
introduced by Metropolis et al. (1953). MCMC was subsequently adapted by
Hastings (1970) for statistical problems and further applied by Diggle et al. (1998) in
geostatistical studies. Recent developments in MCMC computing now allow fully
Bayesian analyses of sophisticated multilevel models for complex geographically
referenced data. This approach also offers full inference for non-Gaussian spatial
data, multivariate spatial data, spatio-temporal data, and solutions to problems such
as geographic and temporal misalignment of spatial data layers (Banerjee et al.
2014).
The data we are studying are monthly rainfall data measured across the state
of South Carolina from the start of 2011 to the end of 2015. The precipitation
record in 2015 is of particular interest because a storm in October 2015 in North
America triggered a high precipitation event, which caused historic flash flooding
across North and South Carolina. Rainfall across parts of South Carolina reached
500-year-event levels (NBC News, October 4, 2015). Accumulations reached
24.23 in. near Boone Hall (Mount Pleasant, Charleston County) by 11:00 a.m.
Eastern Time on October 4, 2015. Charleston International Airport saw a record
24-h rainfall of 11.5 in. (290 mm) on October 3 (Santorelli, October 4, 2015).
Some areas experienced more than 20 in. of rainfall over the 5-day period. Many
locations recorded rainfall rates of 2 in. per hour (National Oceanic and Atmospheric
Administration (NOAA), U.S. Department of Commerce, 2015).
The extraordinary rainfall event was generated by the movement of very moist
air over a stalled frontal boundary near the coast. The clockwise circulation around
a stalled upper level low over southern Georgia directed a narrow plume of tropical
moisture northward and then westward across the Carolinas over the course of
4 days. A low pressure system off the US southeast coast, as well as tropical
moisture related to Hurricane Joaquin (a category 4 hurricane) was the underlying
meteorological cause of the record rainfall over South Carolina during October 1–5,
2015 (NOAA, U.S. Department of Commerce 2015).
Flooding from this event resulted in 19 fatalities, according to the South Car-
olina Emergency Management Department, and South Carolina state officials said
damage losses were 1.492 billion dollars (NOAA, U.S. Department of Commerce
2015). The heavy rainfall and floods, combined with aging and inadequate drainage
infrastructure, resulted in the failure of many dams and flooding of many roads,
bridges, and conveyance facilities, thereby causing extremely dangerous and life-
threatening situations.
The chapter is arranged as follows: in Sect. 2, we give an overview of our
precipitation data, in conjunction with some other variables, e.g., sea surface
temperature, which might help explain the behavior of the precipitation. In Sect. 3,
we introduce the kriging method to analyze the precipitation using a pure spatial
Spatial and Spatio-Temporal Analysis of Precipitation Data from South Carolina 33
2 Data Description
2.1 Overview
The original data used in this research are the daily precipitation records in South
Carolina from National Oceanic and Atmosphere Administration (NOAA) between
2011 and 2015. The original data files include daily precipitation, maximum
temperature, and minimum temperature, along with the latitude, longitude, and
elevation of each observation’s location.
In addition, to investigate the effect of El Niño-Southern Oscillation (ENSO)
activity on precipitation, we have calculated an index based on the monthly sea
surface temperature (SST). The derivation of our index is given in Sect. 2.3.
locations.
n In other words, if we denote the missing value at s∗ by Y (s∗ ), then
i=1 w(si )Y (si ) can be used as the imputed value, where
n
||s∗ − si || ||s∗ − si ||
w(si ) = K / K . (1)
h h
i=1
Note that ||si − s∗ || refers to the haversine distance rather than the Euclidean
distance. We impute missing data based on neighboring observations because doing
so takes the spatial correlation into consideration.
For any inland location si at a given month, we build an index based on the SST
values of the nearest n adjacent ocean observation points {zj }, where j = 1, . . . , n.
Denote this SST-based index as W (si ) for the ith inland location. It follows that
n
1
wj
W (si ) = n SST(zj ), (2)
n l=1 wl
j =1
where the weight wj can be determined by the kernel function K(||si − zj ||) for
j = 1, . . . , n, which is symmetric around 0. We use the standard normal density
as the kernel function. The kernel function includes a bandwidth h, thus making
||s −z ||
wj = h1 K( i h j ). The bandwidth parameter h is set to 0.25 times the range of all
of the distances.
Additionally, we simplify the calculation by considering only locations within a
certain threshold. Figure 3 gives a demonstration to calculate the SST-related index
for Columbia, South Carolina. We first determine the sea temperature records to be
included based on a 300-mile threshold. For the included measurements, we find
their weights by calculating their distance to Columbia, and derive the SST-related
index based on (2). Note that the closer a location is to the coast, the more sea surface
temperature records are used to derive an SST-related index for that location.
36 H. Liu et al.
Fig. 3 A demonstration of the calculation of the SST-related variable. The red points are the
observations that are included in the calculation
In this section, we use a spatial model for the rainfall data without considering
the temporal aspect. Since geostatistical data feature a strong correlation between
adjacent locations, we start by modeling the covariance structure with a variogram,
and then we propose two methods of predicting the rainfall for new location.
We assume that our spatial process has a mean, μ(s) = E(Y (s)), and that the
variance of Y (s) exists for all s ∈ D. The process Y (s) is said to be Gaussian
if, for any n ≥ 1 and any set of sites {s1 . . . , sn }, Y = (Y (s1 ), . . . , Y (sn ))T has
a multivariate normal distribution. Moreover, the process is intrinsic stationary if,
for any given n ≥ 1, any set of n sites {s1 , . . . , sn } and any h ∈ Rr , we have
E[Y (s + h) − Y (s)] = 0, and E[Y (s + h) − Y (s)]2 = Var(Y (s + h) − Y (s)) = 2γ (h)
(Banerjee et al. 2014).
In other words, E[Y (s + h) − Y (s)]2 only depends on h, and not the particular
choice of s. The function 2γ (h) is then called the variogram, and γ (h) is called
the semivariogram. Another important concept is that of an isotropic variogram. If
the semivariogram function γ (h) depends upon the separation vector only through
Spatial and Spatio-Temporal Analysis of Precipitation Data from South Carolina 37
its length ||h|| (distance between observations), then the variogram is isotropic.
Otherwise, it is anisotropic. Isotropic variograms are popular because of simplicity,
interpretability, and, in particular, because a number of relatively simple parametric
forms are available as candidates for the semivariogram, e.g., linear, exponential,
Gaussian, or Matérn (or K-Bessel).
A variogram model is chosen by plotting the empirical semivariogram, a simple
nonparametric estimate of the semivariogram, and then comparing it to the various
theoretical parametric forms (Matheron 1963). For demonstration purposes, we
choose the precipitation values of October 13 in 2015, shortly after the flood struck
South Carolina. Assuming intrinsic stationarity and isotropy, the Matérn model is
used due to its better fit to the empirical semivariogram. The correlation function
of this model allows control of spatial association and smoothness. See Fig. 4 for a
plot of this fit.
Fig. 4 The empirical and parametric (Matérn) variogram for the precipitation values in October
13, 2015
38 H. Liu et al.
We use inverse distance weighting (IDW) (Bivand et al. 2008) to compute a spatially
continuous rainfall estimate as a weighted average for a given location s0 ,
w(si )Z(si )
Ẑ(s0 ) = , where w(si ) = ||si − s0 ||−p .
w(si )
In other words, the weight of a given observed location is based on its Lp -distance
to the interpolation location. If location s0 happens to have an observation, then
the observation itself will be used to avoid the case of infinite weights. The weight
assigned to data points will be more influenced by neighboring points when they are
more clustered. The best p found by cross validation for the analysis of our data set
is approximately 2.5.
Although this method does not incorporate the covariates, it still possesses some
desirable features. For instance, we can make a prediction for the rainfall amount at
every single location with a latitude and longitude.
Since our precipitation data in the study are geostatistical data, we may employ
a linear Gaussian process model (Cressie 1993). We start by defining the spatial
process at location s ∈ Rd as
Note that ||si − sj || is the Euclidean distance between location i and j . Another type
of distance, Geodesic, takes the curvature of the earth’s surface into consideration.
Spatial and Spatio-Temporal Analysis of Precipitation Data from South Carolina 39
We use Euclidean distance since most of our distances are between South Carolina
counties and the effects of curvature are thus negligible.
The exponential model enjoys a simple interpretation. The “nugget” in a
variogram graph is represented by ψ in this model, and this nugget is also the
variance of the non-spatial error. Moreover, κ and φ dictate the scale and range of
the spatial dependence, respectively. Also note that the exponential model assumes
the covariance and hence dependence between two locations decreases as distance
between locations increases, which is sensible for the study of rainfall behavior.
Letting Z = (Z(s1 ), . . . , Z(sn ))T , we estimate the multivariate normal distribu-
tion for Z after parameter estimation. To find the unknown parameters and β,
we use Bayesian methods implemented by the spTimer package in R (Bakar and
Sahu 2015), which requires users to provide sensible prior information based on
sample variogram graphs. Note that this model fitting process will collapse if we
start with initial values far from the true value.
Predictions of the process, Z∗ = (Z(s∗1 ), . . . , Z(s∗m ))T , where s∗i is the ith new
location, can be obtained via the posterior predictive distribution
!
π(Z∗ |Z) = π(Z∗ |Z, , β)π(, β|Z)ddβ,
Hence, one can obtain simulated observations that follow a given covariance
structure by iterating between step 1 and step 2. Bivand et al. (2008) suggest the
method of sequential simulation: (1) compute the conditional distribution with our
given data, (2) draw a value from this conditional distribution, (3) add this value into
the data set, and (4) repeat steps (1)–(3).
As Z becomes a larger matrix as more data are generated, the algorithm
becomes more and more expensive. Many strategies are proposed for reducing the
considerable computational burden posed by matrix operations, including the use
of covariance functions (Hughes and Haran 2013) as well as setting a maximum
number of neighbors (Bivand et al. 2008). In our study, we used the maximum
number of neighbors with the nearest 40 observations.
We illustrate prediction by modeling rainfall in South Carolina on October 13,
2015 with a kriging model that assumes an exponential spatial covariance structure.
Using the Monte Carlo approach described above, we predict by simulating from the
posterior predictive distribution. This can be done repeatedly to give a sense of the
variability associated with the spatial predictions. Figure 5 demonstrates ten simu-
Fig. 5 Ten simulated precipitation heat maps based on kriging. The darker color indicates heavier
precipitation and vice versa. A consistent look reveals a robust performance of the kriging model
Spatial and Spatio-Temporal Analysis of Precipitation Data from South Carolina 41
We now analyze the geostatistical rainfall data across time. Due to the nature of our
rainfall data, the seasonality is of particular interest when we model the temporal
trend. We propose two methods to remove the seasonal trend in this section.
To remove the seasonal trend, one approach is to fit a first-order harmonic regression
model with terms sin(x) and cos(x). In addition, we set x = 2π t if the period is
1. In our case, it is justifiable to set the period as 12 since the monthly rainfall is
measured, and thus x = (π/6)t is used. Hence, one can regress the precipitation
y against dependent variables sin((π/6) t) and cos((π/6) t). The omnibus F-test to
test for the usefulness of the trigonometric terms in this multiple regression model
gives a p-value close to 1, which confirms the existence of seasonality.
One can also use a second-order harmonic model to capture more complex
behavior, in which two more terms, sin[(4π/ω)t] and cos[(4π/ω)t] are included,
where ω is the periodic parameter. However, for our rainfall data, it is unnecessary
to include these two other terms since we observe no great improvement in model
fit by introducing the extra terms (see Fig. 6).
Fig. 6 The fitted model based on the first- and second-order harmonic models. The dotted line
corresponds to the second-order model, and the solid red line corresponds to the first-order model
42 H. Liu et al.
In this section, we discuss how to model spatio-temporal data with two different
methods, the Gaussian process (GP) model and autoregressive (AR) model. The
latter model is an extension of the Gaussian process model obtained by introducing
an autoregressive term.
The independent Gaussian process (GP) model (Cressie and Wikle 2015; Gelfand
et al. 2010) is specified hierarchically in two stages,
Zt = μt + t (4)
μt = Xt β + ηt , (5)
in which Zt = (Z(s1 , t), . . . , Z(sn , t))T , which defines the response variable for
all n locations at time t. It is known that s1 , . . . , sn can be indexed by latitude and
longitude. In the first layer, Zt is defined by a simple mean model plus a pure white
noise term, t . We therefore assume that
where the σ2 is the pure error variance and In is the identity matrix.
The second level models μt as the sum of fixed covariates and random effects at
time t. The fixed term, Xt β, comes from the covariates, and ηt is the spatio-temporal
random effects, ηt = (η(s1 , t), . . . , η(sn , t))T . Similar to t , ηt also follows a
multivariate normal distribution whose mean vector is 0. However, ηt has a more
complicated covariance matrix than does t .
Spatial and Spatio-Temporal Analysis of Precipitation Data from South Carolina 43
We use the exponential function to specify the correlation matrix of the random
effects. The correlation strength is solely based on the distance between si and sj ,
which is given by
η = ση2 H (φ) + τ 2 In ,
where H (φ) = exp(−||si − sj ||)/φ), and ||si − sj || indicates the spatial distance
between location i and j . This function is used to determine each element in the
matrix Sη , where η = ση2 Sη . This parameterization allows ση2 to capture the
invariant spatial variance, and Sη is used to capture the spatial correlation.
The posterior distribution involves three layers, i.e., the prior distribution for
parameters, the mean model, and the random effects model. We will set aside the
prior for later discussion and use π(θ) = π(β, ν, φ, ση2 , σ2 ) to refer to the prior in
general. Thus the posterior is given by
N
g(θ , μ|Z) = π(θ) × fn (Zt |μt , σ2 )gn (μt |β, υ, φ, ση2 ). (7)
t=1
Thus the posterior distribution is given by plugging (8) and (9) into (7). The
logarithm of the joint posterior distribution of the parameters for this Gaussian
process model is given by
N
N
logπ(σ2 , ση2 , μ, β, υ, φ|Z) ∝ log σ2 − (Zt − μt )T (Zt − μt )
2 2σ2
t=1
N 1
N
1
− log |ση2 S η | − − 2 (μt − Xt β)T S −1
η (μt − Xt β) + log π(θ).
2 2ση2 2ση
i=1
44 H. Liu et al.
We specify the prior π(θ) to reflect the assumption that β, ν, φ, ση2 , and σ2
are mutually independent, so the joint prior is the product of the marginal prior
densities, which are given as follows: All the parameters describing the mean,
e.g., β and ρ (see Sect. 5.2) are given independent normal prior distributions, with
the prior on ρ truncated to have support on (−1, 1). We assume φ and ν both
follow uniform distributions, while the prior for the precision (inverse of variance)
parameter is a gamma distribution. We choose the hyperparameters to make these
prior distributions very diffuse.
In this section, we introduce the autoregressive model (Sahu and Bakar 2012). The
hierarchical AR(1) model is given as follows:
Zt = μt + t
μt = ρμt−1 + Xt β + ηt ,
N
N
logπ(σ2 , ση2 , μ, β, υ, φ|Z) ∝ log σ2 − (Zt − μt )T (Zt − μt )
2 2σ2
t=1
N
− log |ση2 S η |
2
1
N
1 T −1
− − 2 (μt − ρμt−1 − Xt β) S η (μt − ρμt−1 − Xt β)
2ση2 2ση
i=1
1 1
− log |σ02 S 0 | − (μ0 − β 0 )T S −1
0 (μ0 − β 0 ) + log π(θ )
2 2σ02
Note that β 0 is only a mean vector for the initial random effect term, which is
different from β, which refers to regression coefficients corresponding to covariates
X. In other words, the terms in the last line (except log π(θ)) derive from the initial
random effect term.
Spatial and Spatio-Temporal Analysis of Precipitation Data from South Carolina 45
In this section, we fit the AR(1) model with monthly precipitation data from the
beginning of year 2011 to the end of year 2015. A natural log transformation was
initially applied to the precipitation to improve the model fit and ensure positive
predicted rainfall values once we back-transform by exponentiating the predicted
log-rainfall values. We include temperature range, sea surface temperature, and
elevation as monthly covariates.
We initially found that ordinary temperature measurements such as the monthly
average temperature were not apparently related to precipitation after accounting for
the season and thus we did not include these in the model. However, measurements
of variability in temperature over each month, e.g., the range of daily maxima and
the range of daily minima over a month, were believed to have an effect on precip-
itation and thus we include these to determine whether their effects are significant.
We also include a flood-year indicator as a dummy variable, where data
from 2015 is labeled as 1 and otherwise 0, to account for the unusual October
precipitation amounts in this year. Interaction terms involving the dummy variable
were also tested, none of which were statistically significant and were thus removed
from the final model. The acceptance rate from Metropolis step for all parameters is
42.97% and a brief summary of model fitting details is given as follows:
-----------------------------------------------------
Model: AR
Call: LOG ~ RANGE_OVERALL + RANGE_LOW + RANGE_HIGH
+ SST + ELEVATION + SST * RANG E_LOW + Year2015
Iterations: 5000
nBurn: 1000
Acceptance rate: 29.76
-----------------------------------------------------
Parameters
Mean Median SD Low2.5p Up97.5p
(Intercept) 0.3635 0.3689 0.1363 0.0894 0.6265
RANGE_OVERALL -0.0006 -0.0006 0.0017 -0.0039 0.0027
RANGE_LOW 0.0017 0.0017 0.0030 -0.0040 0.0078
RANGE_HIGH 0.0006 0.0007 0.0011 -0.0016 0.0028
SST -0.0057 -0.0058 0.0045 -0.0142 0.0033
ELEVATION 0.0001 0.0001 0.0001 0.0000 0.0002
Year2015 0.0808 0.0810 0.0180 0.0450 0.1154
RANGE_LOW:SST -0.0001 -0.0001 0.0001 -0.0003 0.0001
rho 0.0756 0.0757 0.0151 0.0466 0.1054
sig2eps 0.0054 0.0054 0.0002 0.0051 0.0057
sig2eta 0.0764 0.0739 0.0121 0.0617 0.1073
phi 0.0501 0.0502 0.0090 0.0322 0.0659
-----------------------------------------------------
46 H. Liu et al.
Fig. 8 The 95% confidence interval for β1 (the SST-related variable) and β2 (elevation) over
12 months in 2015
Fig. 9 The residual plot and QQ plot from the state-space model
7 Discussion
We have presented both spatial and spatio-temporal models for rainfall in South Car-
olina during a period including one of the most destructive storms in state history.
Our models have allowed us to determine several covariates that affect the rainfall
and to interpret their effects. In particular, the flood year of 2015 was an important
indicator of rainfall and elevation also had a positive significant effect on precip-
itation. There was a significant positive correlation in rainfall measurements over
time. Finally, our novel SST index provided some evidence that cooler nearby sea
temperatures corresponded to higher rainfall at in land sites although this SST effect
was not significant at the 0.05 level based on a 95% credible interval for its effect.
A spatial prediction at a new location and a temporal prediction at a future time
point can be obtained based on the posterior predictive distribution for Z(s0 , t ),
where s0 denotes a new location and t is a future time point. Further details
regarding these predictions are provided in Cressie and Wikle (2015) for the GP
models, and Sahu and Bakar (2012) for the AR models.
A limitation of the study, and a direction for future research, is that the model
does not account for the apparent heavy-tailed nature of the errors. Methods
involving generalized extreme value distribution (Rodríguez et al. 2016) could
possibly be adapted to this model to help handle this heavy-tailed error structure,
but such research is still relatively new in the spatio-temporal modeling literature.
References
Banerjee, S., Carlin, B.P., Gelfand, A.E.: Hierarchical Modeling and Analysis for Spatial Data.
CRC Press, Boca Raton (2014)
Benzécri, J.P.: L’Analyse des Données. Dunod, Paris (1973)
Berne, A., Delrieu, G., Boudevillain, B.: Variability of the spatial structure of intense Mediter-
ranean precipitation. Adv. Water Resour. 32(7), 1031–1042 (2009)
Bivand, R.S., Pebesma, E.J., Gomez-Rubio, V., Pebesma, E.J.: Applied Spatial Data Analysis with
R. Springer, New York (2008)
Ciach, G.J., Krajewski, W.F.: Analysis and modeling of spatial correlation structure in small-scale
rainfall in central Oklahoma. Adv. Water Resour. 29(10), 1450–1463 (2006)
Cressie, N.: Statistics for Spatial Data. Wiley, New York (1993)
Cressie, N., Wikle, C.K.: Statistics for Spatio-Temporal Data. Wiley, New York (2015)
Deidda, R.: Rainfall downscaling in a space-time multifractal framework. Water Resour. Res.
36(7), 1779–1794 (2000)
Delhomme, J.P.: Kriging in the hydrosciences. Adv. Water Resour. 1(5), 251–266 (1978)
Delfiner, P., Delhomme, J.P.: Optimum Interpolation by Kriging. Ecole Nationale Supérieure des
Mines, Paris (1975)
Diggle, P.J., Tawn, J.A., Moyeed, R.A.: Model-based geostatistics. J. R. Stat. Soc.: Ser. C: Appl.
Stat. 47(3), 299–350 (1998)
Dima, M., Lohmann, G.: Evidence for two distinct modes of large-scale ocean circulation changes
over the last century. J. Clim. 23(1), 5–16 (2010)
Dumitrescu, A., Birsan, M.V., Manea, A.: Spatio-temporal interpolation of sub-daily (6 h)
precipitation over Romania for the period 1975–2010. Int. J. Climatol. 36(3), 1331–1343 (2016)
Ferraris, L., Gabellani, S., Rebora, N., Provenzale, A.: A comparison of stochastic models for
spatial rainfall downscaling. Water Resour. Res. 39(12), 1368 (2003). https://doi.org/10.1029/
2003WR002504
Finley, A.O., Banerjee, S., Carlin, B.P.: spBayes: an R package for univariate and multivariate
hierarchical point-referenced spatial models. J. Stat. Softw. 19(4), 1 (2007)
Gelfand, A.E., Diggle, P., Guttorp, P., Fuentes, M.: Handbook of Spatial Statistics. CRC Press,
Boca Raton (2010)
Georgakakos, K.P., Kavvas, M.L.: Precipitation analysis, modeling, and prediction in hydrology.
Rev. Geophys. 25(2), 163–178 (1987)
Häkkinen, S.: Decadal air-sea interaction in the North Atlantic based on observations and modeling
results. J. Clim. 13(6), 1195–1219 (2000)
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications.
Biometrika 57(1), 97–109 (1970)
Hughes, J., Haran, M.: Dimension reduction and alleviation of confounding for spatial generalized
linear mixed models. J. R. Stat. Soc. Ser. B Stat Methodol. 75(1), 139–159 (2013)
Isaaks, H.E., Srivastava, R.M.: Applied Geostatistics. Oxford University Press, New York (1989)
Kumar, P., Foufoula-Georgiou, E.: Characterizing multiscale variability of zero intermittency in
spatial rainfall. J. Appl. Meteorol. 33(12), 1516–1525 (1994)
Ly, S., Charles, C., Degre, A.: Geostatistical interpolation of daily rainfall at catchment scale: the
use of several variogram models in the Ourthe and Ambleve catchments, Belgium. Hydrol.
Earth Syst. Sci. 15(7), 2259–2274 (2011)
Matheron, G.: Principles of geostatistics. Econ. Geol. 58(8), 1246–1266 (1963)
Mehta, V., Suarez, M., Manganello, J.V., Delworth, T.D.: Oceanic influence on the North Atlantic
oscillation and associated northern hemisphere climate variations: 1959–1993. Geophys. Res.
Lett. 27(1), 121–124 (2000)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state
calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
National Oceanic and Atmosphere Administration, U.S. Department of Commerce: Service
assessment: the historic South Carolina floods of October 1–5, 2015. www.weather.gov/
media/publications/assessments/SCFlooding_072216_Signed_Final.pdf (2015). Accessed 4
Dec 2017
50 H. Liu et al.
Rodríguez, S., Huerta, G., Reyes, H.: A study of trends for Mexico city ozone extremes: 2001–
2014. Atmósfera 29(2), 107–120 (2016)
Sahu, S.K., Bakar, K.S.: Hierarchical Bayesian autoregressive models for large space–time data
with applications to ozone concentration modeling. Appl. Stoch. Model. Bus. Ind. 28(5), 395–
415 (2012)
Samadi, S., Tufford, D., Carbone, G.: Estimating hydrologic model uncertainty in the presence of
complex residual error structures. Stoch. Environ. Res. Risk Assess. 32(5), 1259–1281 (2018)
Sharon, D.: Spatial analysis of rainfall data from dense networks. Hydrol. Sci. J. 17(3), 291–300
(1972)
Stroud, J.R., Müller, P., Sansó, B.: Dynamic models for spatio-temporal data. J. R. Stat. Soc. Ser.
B Stat. Methodol. 63(4), 673–689 (2001)
Tabios III, Q.G., Salas, J.D.: A comparative analysis of techniques for spatial interpolation of
precipitation. Water Resour. Bull. 21(3), 365–380 (1985)
Thiessen, A.H.: Precipitation averages for large areas. Mon. Weather Rev. 39(7), 1082–1084 (1911)
Troutman, B.M.: Runoff prediction errors and bias in parameter estimation induced by spatial
variability of precipitation. Water Resour. Res. 19(3), 791–810 (1983)
Wang, C., Enfield, D.B., Lee, S.K., Landsea, C.W.: Influences of the Atlantic warm pool on western
hemisphere summer rainfall and Atlantic hurricanes. J. Clim. 19(12), 3011–3028 (2006)
A Sparse Areal Mixed Model
for Multivariate Outcomes,
with an Application to Zero-Inflated
Census Data
1 Introduction
The Committee on National Statistics assembled the Panel to Review the 2010
Census to suggest general priorities for research in preparation for the 2020
Census. In their first interim report (Cook et al. 2011) the Panel laid out three
recommendations, the first of which highlighted “four priority topic areas, in order
to achieve a lower cost and high-quality 2020 Census.” A theme across these priority
areas was the effective use of Census Bureau databases (e.g., geographic databases
and databases built with administrative records) to achieve operational objectives.
In addition to implementing recommendations from the Panel to Review the 2010
Census, the Census Bureau is placing increasing emphasis on accurate model-based
predictions as a way to more generally conduct efficient and cost-effective surveys
(U.S. Census Bureau 2015).
One of the Census Bureau’s most prominent databases is the Master Address File
(MAF), which is a continually updated inventory of all known living quarters in the
USA and its island territories. The MAF is used as a sampling frame for various
Census Bureau surveys, including the decennial Census. The MAF comprises
D. Musgrove
Medtronic, Minneapolis, MN, USA
D. S. Young ()
Department of Statistics, University of Kentucky, Lexington, KY, USA
e-mail: derek.young@uky.edu
J. Hughes
Department of Biostatistics and Informatics, University of Colorado, Denver, CO, USA
L. E. Eberly
Division of Biostatistics, University of Minnesota, Minneapolis, MN, USA
In this section we develop our multivariate sparse areal mixed model (MSAMM).
Our approach is similar to the approach of Bradley et al. (2015) in that we, too,
employ the orthogonal, multiresolutional spatial basis described by Hughes and
Haran (2013) (see also Griffith (2003) and Tiefelsdorf and Griffith (2007)). This
basis, known as the Moran (1950) basis, is appealing from a modeling point of view
and also permits efficient computing.
where g is a link function; y = (y1 , . . . , yn ) are the outcomes, the ith of which
is associated with the ith areal unit; X is an n × p design matrix; β is a p-vector
54 D. Musgrove et al.
duced their sparse areal mixed model (SAMM). In signal processing, statistics, and
related fields, it is not uncommon to use the term “sparse” to refer to representation
of a signal in terms of a small number of generating elements drawn from an
appropriately chosen domain (Donoho and Elad 2003). We use the term “sparse”
in precisely this sense, since our model accomplishes spatial smoothing by using
q n Moran basis vectors (as opposed to traditional CAR models, which have
approximately n spatial random effects). The SAMM can be developed as follows.
Reich et al. (2006) showed that the traditional CAR models are spatially
confounded in the sense that the random effects can “pollute” the regression
manifold C(X), which can lead to a biased and variance-inflated posterior for β.
To see this, first let P be the orthogonal projection onto C(X), so that In − P is the
orthogonal projection onto C(X)⊥ . Now eigendecompose P and In − P to obtain
orthogonal bases Kn×p and Ln×(n−p) for C(X) and C(X)⊥ , respectively. Then (1)
can be rewritten as
g(μ) = Xβ + Kγ + Lδ,
where γ p×1 and δ (n−p)×1 are random coefficients. This form shows that K is the
source of the confounding, for K and X have the same column space.
Since the columns of K are merely synthetic predictors (i.e., they have no
scientific meaning), Reich et al. (2006) recommend removing them from the model.
The resulting model (henceforth the RHZ model) has
g(μ) = Xβ + Lδ,
Fig. 1 Three Moran basis vectors, exhibiting spatial patterns of increasingly finer scale
Boots and Tiefelsdorf (2000) showed that (1) the (standardized) spectrum of a
Moran operator comprises the possible values for the corresponding IX (A), and (2)
the eigenvectors comprise all possible mutually distinct patterns of clustering resid-
ual to C(X) and accounting for G. The positive (negative) eigenvalues correspond
to varying degrees of positive (negative) spatial dependence, and the eigenvectors
associated with a given eigenvalue (ωi , say) are the patterns of spatial clustering that
data exhibit when the dependence among them is of degree ωi .
In other words, the eigenvectors of the Moran operator form a multiresolutional
spatial basis for C(X)⊥ that exhausts all possible patterns that can arise on G.
Since we do not expect to observe repulsion in the phenomena to which these
models are usually applied, we can use the spectrum of the operator to discard all
repulsive patterns, retaining only attractive patterns for our analysis. By retaining
only eigenvectors that exhibit positive spatial dependence, we can usually reduce
the model dimension by at least half a priori. Hughes and Haran (2013) showed that
a much greater reduction is often possible in practice, with 50–100 eigenvectors
being sufficient in many cases. Three example Moran vectors are shown in Fig. 1.
Let Mn×q contain the first q n eigenvectors of the Moran operator. Then the
SAMM has first stage
g(μ) = Xβ + Mδ s ,
where δ s (“s” for “sparse”) is a q-vector of random coefficients that are assumed to
be jointly Gaussian:
however, for (2) is not arbitrary (see Reich et al. (2006) and/or Hughes and
Haran (2013) for derivations) but is, in fact, very well suited to the task at hand.
Specifically, two characteristics of (2) discourage overfitting even when q is too
large for the dataset being analyzed. First, the prior variances are commensurate with
the spatial scales of the predictors in M. This shrinks toward zero the coefficients
corresponding to predictors that exhibit small-scale spatial variation. Additionally,
the correlation structure of (2) effectively reduces the degrees of freedom in the
smoothing component of the model.
A number of multivariate CAR (MCAR) models have been developed (Carlin and
Banerjee 2003; Gelfand and Vounatsou 2003; Jin et al. 2005; Martinez-Beneito
2013). These models have the same drawbacks as their univariate counterparts, but
of course entail even more burdensome computation. Thus it is desirable to develop
a SAMM for multivariate outcomes. We begin by reviewing the MCAR model that
is the multivariate analog of the traditional univariate CAR model described above.
Suppose we observe multiple outcomes at each areal unit and that each outcome
has its own regression component and collection of spatial effects. Specifically,
for j ∈ {1, . . . , J } we have outcomes y j = (y1j , . . . , ynj ) , design matrix Xj ,
regression coefficients β j , and spatial effects φ j = (φ1j , . . . , φnj ) . Then the
transformed conditional mean vectors are given by
gj (μj ) = Xj β j + φ j .
Recently, Bradley et al. (2015) introduced the Moran’s I (MI) prior, which is a
multivariate spatiotemporal model based on the SAMM. We introduce a multivariate
model that uses a similar prior but is strictly for spatial data. We call our model the
multivariate SAMM (MSAMM). The MSAMM serves as the foundation for the
zero-inflated count model that we focus on below in Sect. 4.
Construction of the MSAMM is of course analogous to construction of the
SAMM. For j ∈ {1, . . . , J }, let Pj = Xj (Xj Xj )−1 Xj , and let Mj be a matrix,
the columns of which are the first q eigenvectors of (In − Pj )A(In − Pj ). Denote
the prior precision matrix as Qsj = Mj QMj , and let Rsj be the upper Cholesky
triangle of Qsj so that Rsj Rsj = Qsj . Then the MSAMM can be specified as
% &
gj (μj ) = gj E y j | β j , δ sj = Xj β j + Mj δ sj
$
1
p( | ) ∝ exp − R −1 ⊗ Iq R ,
2
In either case the precision matrix is invertible, and so the prior distribution is proper.
Although using a truncated Moran basis dramatically reduces the time required to
draw samples from the posterior, and the space required to store those samples, this
approach does incur the substantial up-front burden of computing and eigendecom-
posing (In −Pj )A(In −Pj ). The efficiency of the former can be increased by storing
A in a sparse format and parallelizing the matrix multiplications. And we can more
efficiently obtain the desired basis vectors by computing only the first q eigenvectors
A Sparse Areal Mixed Model for Multivariate Outcomes, with an Application. . . 59
recent years, many novel finite mixture models have been developed to incorporate
spatial dependencies. Alfó et al. (2009) used finite mixture models to analyze
multiple, spatially correlated, counts, where the dependence among outcomes is
modeled using a set of correlated random effects. Green and Richardson (2002)
developed a class of hidden Markov models in the spatial domain to analyze spatial
heterogeneity of count data on a rare phenomenon. Neelon et al. (2015) developed a
broad class of Bayesian two-part models for the spatial analysis of semicontinuous
data. Torabi (2016) proposed a hierarchical multivariate mixture generalized linear
model to simultaneously analyze spatial normal and non-normal outcomes. Zero-
inflated count models are often applied in non-spatial settings, e.g., in manufacturing
(Lambert 1992), where defective materials are rare and the number of defects is
assumed to follow a Poisson distribution, and in the hunger-for-bonus phenomenon
that occurs in risk assessment for filed insurance claims (Boucher et al. 2009).
Spatial zero-inflated count models have been applied to various types of data,
including animal sightings (Agarwal et al. 2002; Ver Hoef and Jansen 2007; Recta
et al. 2012), plant distribution (Rathbun and Fei 2006), tornado reports (Wikle and
Anderson 2003), and emergency room visits (Neelon et al. 2013).
Two common approaches to modeling zero-inflated counts are the hurdle model
and the zero-inflated-Poisson (ZIP) model (Lambert 1992). For a hurdle model, the
outcome is 0 with probability 1 − π , and with probability π the outcome arose from
a zero-truncated Poisson (ZTP) distribution (Cohen 1960; Singh 1978). Formally,
the hurdle model is of the form
P(y = 0) = 1 − π
exp(−λ) λk
P(y = k) = π (k ∈ N : k ≥ 1).
1 − exp(−λ) k!
P(y = 0) = (1 − π ) + π exp(−λ)
λk
P(y = k) = π exp(−λ) (k ∈ N : k ≥ 1).
k!
Let us compare and contrast the hurdle and ZIP models informally. Each model
can be viewed as comprising a binary process (the incidence process) and a counting
process (the prevalence process). For the hurdle model there is only one source of
zeros, namely the binary process. If the binary outcome is 0, no count is observed.
If the binary outcome is 1, a nonzero count is observed. The ZIP model differs in
that it posits two sources of zeros. If the binary outcome is 0, no count is observed.
A Sparse Areal Mixed Model for Multivariate Outcomes, with an Application. . . 61
exp(ηi1 )
πi =
1 + exp(ηi1 )
λi = exp(ηi2 ),
where πi is the probability of incidence for the ith areal unit, and λi is the ZTP rate
for the ith areal unit. The within-unit covariance matrix is of course 2 × 2 for this
model. Clearly, this model accommodates (1) spatial dependence among areal units,
and (2) dependence between the incidence process and the prevalence process. For
the case of a positive off-diagonal value in , the latter source of dependence implies
that a higher probability of incidence is associated with a higher prevalence rate. As
we will see in the next section, our MSAMM hurdle model’s ability to accommodate
consequential dependence between πi and λi permits improved inference and fit.
62 D. Musgrove et al.
To assess the performance of our areal hurdle model, we carried out a simulation
study. We simulated data for the 2600 census block groups of the US state of Iowa
(Fig. 2). We included an intercept term and, as a covariate, the percentage of housing
1.0
0.9
0.8
0.7
0.6
(a) 0.5
0.4
0.3
0.2
0.1
0.0
38.0
32.6
27.1
21.7
(b)
16.3
10.9
5.4
0.0
Fig. 2 A single simulated zero-inflated dataset for the census block groups of Iowa. The propor-
tion of renters in each block group was used as a covariate. Panel (a) displays the probabilities of
incidence. Panel (b) displays the ZTP rates, where a given rate is nonzero only if the underlying
binary outcome is equal to 1
A Sparse Areal Mixed Model for Multivariate Outcomes, with an Application. . . 63
units occupied by renters (see Sect. 4.4). We used β 1 = (−1, 1) and β 2 = (2, −1) .
These values for β 1 indicate that block groups with a high percentage of renters
will be more likely to take nonzero values than block groups with a low percentage
of renters. The values of β 2 indicate that as the proportion of renters increases, the
ZTP rate decreases. We used 4 and 8 as the diagonal elements of , and we used
five different values for the within-unit correlation: ρ = 0, 0.2, 0.4, 0.6, 0.8.
We constructed Qs as detailed in Sect. 2.3 and used the same Qs for both
processes. Eigendecomposition of the Moran operator yielded 1000 basis vectors
exhibiting patterns of positive spatial dependence. We used the first q = 250
eigenvectors to construct M. This choice of q allowed the responses to exhibit
both small- and large-scale spatial variation. We then simulated spatial effects
= (δ s1 , δ s2 ) from a zero-mean Gaussian distribution with covariance ⊗ Q−1 s .
For the ith block group we simulated yi1 from the Bernoulli distribution with
success probability πi = logit−1 (x i β 1 + mi δ s1 ). Conditional on yi1 = 1, we drew
yi2 from the zero-truncated Poisson distribution with rate λi = exp(x i β 2 + mi δ s2 ).
Finally, we let yi = 0 if yi1 = 0, or yi = yi2 if yi1 = 1.
We analyzed 1000 simulated datasets for each of the five correlations. To assess
the importance of modeling the dependence within areal units, we applied both
the MSAMM and independent SAMMs to each dataset. Key results are shown in
Table 1. Extended results, including credible interval coverage rates, are included in
an appendix. We see that neglecting within-unit dependence leads to larger biases
and, for some parameters, larger mean squared errors, especially for larger values
of ρ.
Note that we did not compare the performance of our model to the performance
of one or more hurdle MCAR models, for three reasons. First, to the best of our
knowledge, no hurdle MCAR model has been implemented in software. Second, any
MCAR model is spatially confounded for the same reason that the univariate CAR
models are spatially confounded. And third, fitting any MCAR model would be
terribly burdensome computationally (with respect to both running time and storage
requirement) for the same reason that fitting any univariate CAR model would be
burdensome. Hence, pitting a hurdle MCAR model against our hurdle MSAMM
would have led to no new knowledge.
In this section we apply our areal hurdle model to address deletes from the 2010 US
Census within the state of Iowa. Recall that a delete is defined as an address that was
deleted from the base count because it did not correspond to a valid housing unit.
64 D. Musgrove et al.
0.8
Fig. 3 Histogram of number
of address deletes for the
2600 block groups of Iowa
from the 2010 Census.
Approximately 75% of the
outcomes are zeros and 4% of
0.6
the outcomes are greater
than 10
Observed percent
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8 9 10
Number of deletes
Table 2 Iowa address delete results for the MSAMM versus independent SAMMs
MSAMM Independent SAMMs
Predictor/parameter Posterior mean 95% CI Posterior mean 95% CI
Intercept 7.23 (−4.43, −9.83) 7.48 (−4.93,10.53)
RURAL _ POP 2.72 (−2.18, −3.30) 2.70 (−2.13, −3.29)
OCCP _ HU 0.68 (−2.68, −4.60) 0.25 (−3.21, −3.61)
RENTER _ OCCP _ HU −3.51 (−5.73, −1.41) −3.60 (−6.01, −1.16)
TEA _ MAIL −6.95 (−7.54, −6.37) −6.98 (−7.60, −6.40)
FIRST _ FRMS −7.10 (−9.93, −4.28) −6.86 (−9.82, −3.86)
σ12 0.67 (−0.30, −2.04) 0.94 (−0.21, −2.52)
Intercept 3.99 (−3.50, −4.50) 3.92 (−3.07, −4.69)
RURAL _ POP −0.00 (−0.18, −0.15) 0.02 (−0.14, −0.18)
OCCP _ HU 2.52 (−1.07, −3.59) 2.00 (−0.48, −3.17)
RENTER _ OCCP _ HU −1.47 (−2.28, −0.68) −1.16 (−1.99, −0.16)
TEA _ MAIL −2.03 (−2.22, −1.83) −1.98 (−2.24, −1.73)
FIRST _ FRMS −6.11 (−7.39, −4.65) −5.50 (−7.00, −2.93)
σ22 7.74 (−4.99,11.47) 8.24 (−5.37,12.01)
ρ 0.73 (−0.25, −0.91) –
pD 258.85 172.31
DIC −7745 −7735
Results for the binary components of the models are shown in the top portion of the table. Results
for the count components are shown in the bottom portion. CI denotes credible interval
TEA _ MAIL , and FIRST _ FRM offer a “protective effect” against the number of
deletes, but now OCCP_HU is associated with a greater number of deletes. Evi-
dently, the spatial process is not as smooth for the count component since σ̂22 =
7.74 (4.99, 11.47).
The results obtained using our areal hurdle model could be valuable to the
Census Bureau. When using the most recent covariate values obtained (such as from
official government surveys or administrative records), predictions using our model
could help to characterize deletes, which is an indication of demographic change
or stability in an area. The spatial component of our model can assist in designing
efficient and cost-effective address updating operations. For example, it can inform
Census Bureau personnel as to clusters of block groups that are candidates for
updating in a non-decennial Census setting. Focusing on adjacent regions within
a cluster will be advantageous over assessing sets of block groups that might have
only “stable” block groups as neighbors. Such clustering will not always be captured
accurately by non-spatial zero-inflated models.
Our proposed methods for handling multivariate and zero-inflated areal data
offer improved regression inference while greatly reducing computing time and
storage requirements. Our simulation study illustrates the benefit of accounting for
dependence within areal units as well as among areal units. This is not surprising:
in general, multivariate data call for multivariate methods.
The count distribution used for our model is the Poisson, which requires the
assumption of equi-dispersion. Of course, the data could be heavily over- or under-
dispersed, in which case other distributions could be developed in our MSAMM
setup, such as the negative binomial or the Conway–Maxwell–Poisson distribution.
Both of these distributions have an additional parameter that characterizes the
dispersion, which could possibly depend on spatially varying covariates. These
different models would be novel, but would require additional numerical work to
demonstrate how well they improve the fits.
Application of our areal Poisson hurdle model to zero-inflated Census data
provided a superior fit relative to that provided by independent univariate models,
at no extra computational cost. Most importantly, our methodology provides a
compelling framework for understanding dynamic features of the USA, which could
aid the planning of various Census Bureau operations. Moreover, our methodology
could be extended to handle additional data challenges faced by the Census Bureau.
For example, a spatiotemporal extension of our MSAMM could be useful for
analyzing data from historical databases being developed by the Census Bureau.
In such a model, time-dependent covariates could be viewed as driving the deletion
of housing units.
68 D. Musgrove et al.
For the multivariate sparse areal mixed model (MSAMM), when the design matrices
are the same across multivariate outcomes, i.e., X1 = X2 = · · · = XJ , the first and
second stages can be written as
% &
gj E y j | β j , δ sj = Xβ j + Mδ sj (j = 1, . . . , J )
p ( | ) = N 0, ⊗ Q−1 s ,
where = δ s1 , . . . , δ sJ , each δ sj is q × 1, is the J × J covariance matrix, and
Qs is the q × q spatial precision matrix.
Computation can be eased considerably as follows. Let Rs be the upper Cholesky
triangle of Qs , and let Ws = R−1 such that Ws Ws = Q−1 s . Then, for =
s
ψ s1 , . . . , ψ sJ , each ψ sj is q × 1, and | ∼ N 0, ⊗ Iq , we have that
(IJ ⊗ Ws ) and have the same distribution conditional on . This is easy to see
since E {(IJ ⊗ Ws ) } = (IJ ⊗ Ws ) E () = 0 and
cov {(IJ ⊗ Ws ) } = (IJ ⊗ Ws ) ⊗ Iq (IJ ⊗ Ws )
= ⊗ Q−1
s .
Hence, the model’s first and second stages can now be written as
% &
gj E y j | β j , ψ sj = Xβ j + MWs ψ sj (j = 1, . . . , J )
p ( | ) = N 0, ⊗ Iq .
where = δ s1 , . . . , δ sJ , R = bdiag (Rs1 , . . . , RsJ ), and Rsj Rsj = Qsj , where
Rsj is the upper Cholesky triangle of Qsj . For ease of exposition, let J = 2 (the
following easily extends to the case when J > 2). The prior distribution of the
spatial effects can be written
A Sparse Areal Mixed Model for Multivariate Outcomes, with an Application. . . 69
⎡ ) *−1 ⎤
R
δ s1 0 Rs1 0 s1 0
| ∼ N⎣ , −1 ⊗ Iq ⎦
δ s2 0 0 Rs2 0 Rs2
+ ,
0 Ws1 0 Ws1 0
= N , ⊗ Iq ,
0 0 Ws2 0 Ws2
−1
where Wsj = R−1 −1
sj (j = 1, 2), and we have used the fact that Rsj = Rsj .
Now, suppose we have
$
ψ s1 0
| ∼ N , ⊗ Iq .
ψ s2 0
Then, since
Ws1 0 ψ s1 Ws1 ψ s1
= ,
0 Ws2 ψ s2 Ws2 ψ s2
we can apply a reparameterization similar to the case where design matrices are
equivalent across the outcomes. Thus we can specify the first and second stages of
the model as
% &
gj E y j | β j , ψ sj = Xj β j + Mj Wsj ψ sj
p ( | ) = N 0, ⊗ Iq .
References
Agarwal, D.K., Gelfand, A.E., Citron-Pousty, S.: Zero-inflated models with application to spatial
count data. Environ. Ecol. Stat. 9(4), 341–355 (2002)
Alfó, M., Nieddu, L., Vicari, D.: Finite mixture models for mapping spatially dependent disease
counts. Biom. J. 51(1), 84–97 (2009). http://dx.doi.org/10.1002/bimj.200810494
Assunção, R., Krainski, E.: Neighborhood dependence in Bayesian spatial models. Biom. J. 51(5),
851–869 (2009)
Barnard, J., McCulloch, R., Meng, X.L.: Modeling covariance matrices in terms of standard
deviations and correlations, with application to shrinkage. Stat. Sin. 10(4), 1281–1312 (2000)
Besag, J., Kooperberg, C.: On conditional and intrinsic autoregression. Biometrika 82(4), 733–746
(1995)
Boots, B., Tiefelsdorf, M.: Global and local spatial autocorrelation in bounded regular tessellations.
J. Geogr. Syst. 2(4), 319 (2000)
Boucher, J.P., Denuit, M., Guillen, M.: Number of accidents or number of claims? An approach
with zero-inflated Poisson models for panel data. J. Risk Insur. 76(4), 821–846 (2009)
Bradley, J.R., Holan, S.H., Wikle, C.K.: Multivariate spatio-temporal models for high-dimensional
areal data with application to longitudinal employer-household dynamics. Ann. Appl. Stat. 9(4),
1761–1791 (2015)
Burnham, K.P., Anderson, D.R., Huyvaert, K.P.: AIC model selection and multimodel inference in
behavioral ecology: some background, observations, and comparisons. Behav. Ecol. Sociobiol.
65(1), 23–35 (2011)
Carlin, B.P., Banerjee, S.: Hierarchical multivariate CAR models for spatio-temporally correlated
survival data (with discussion). In: Bayarri, M., Berger, J., Bernardo, J., Dawid, A., Heckerman,
D., Smith, A., West, M. (eds.), Bayesian Statistics 7, pp. 45–63. Oxford University Press, New
York (2003)
Clayton, D., Bernardinelli, L., Montomoli, C.: Spatial correlation in ecological analysis. Int. J.
Epidemiol. 22(6), 1193–1202 (1993)
Cohen, A.C.: Estimating the parameter in a conditional Poisson distribution. Biometrics 16(2),
203–211 (1960)
Cook, T., Norwood, J., Cork, D., Panel to Review the 2010 Census, Committee on National
Statistics, Division of Behavioral and Social Sciences and Education, National Research
Council: Change and the 2020 Census: Not Whether But How. National Academies Press,
Washington, D.C. (2011)
Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries
via 1 minimization. Proc. Natl. Acad. Sci. 100(5), 2197–2202 (2003)
Eddelbuettel, D., Francois, R.: Rcpp: Seamless R and C++ integration. J. Stat. Softw. 40(8), 1–18
(2011)
Eddelbuettel, D., Sanderson, C.: RcppArmadillo: Accelerating R with high-performance C++
linear algebra. Comput. Stat. Data Anal. 71, 1054–1063 (2014)
Flegal, J.M., Haran, M., Jones, G.L.: Markov chain Monte Carlo: can we trust the third significant
figure? Stat. Sci. 23(2), 250–260 (2008)
Gelfand, A.E., Vounatsou, P.: Proper multivariate conditional autoregressive models for spatial data
analysis. Biostatistics 4(1), 11–15 (2003)
Green, P.J., Richardson, S.: Hidden Markov models and disease mapping. J. Am. Stat. Assoc.
97(460), 1055–1070 (2002). https://doi.org/10.1198/016214502388618870
Griffith, D.A.: Spatial Autocorrelation and Spatial Filtering: Gaining Understanding Through
Theory and Scientific Visualization. Springer, Berlin (2003)
Haran, M., Hughes, J.: batchmeans: consistent batch means estimation of Monte Carlo standard
errors. Denver (2016)
Haran, M., Hodges, J., Carlin, B.: Accelerating computation in Markov random field models for
spatial data via structured MCMC. J. Comput. Graph. Stat. 12(2), 249–264 (2003)
A Sparse Areal Mixed Model for Multivariate Outcomes, with an Application. . . 73
Haran, M., Tierney, L.: On automating Markov chain Monte Carlo for a class of spatial models.
Preprint (2012). arXiv:12050499
Hodges, J., Reich, B.: Adding spatially-correlated errors can mess up the fixed effect you love.
Am. Stat. 64(4), 325–334 (2010)
Huang, A., Wand, M.: Simple marginally noninformative prior distributions for covariance
matrices. Bayesian Anal. 8(2), 439–452 (2013)
Hughes, J., Haran, M.: Dimension reduction and alleviation of confounding for spatial generalized
linear mixed models. J. R. Stat. Soc. Ser. B Stat. Methodol. 75(1), 139–159 (2013)
Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5,
299–314 (1996)
Jin, X., Carlin, B.P., Banerjee, S.: Generalized hierarchical multivariate CAR models for areal data.
Biometrics 61(4), 950–961 (2005)
Knorr-Held, L., Rue, H.: On block updating in Markov random field models for disease mapping.
Scand. J. Stat. 29(4), 597–614 (2002)
Lambert, D.: Zero-inflated Poisson regression, with an application to defects in manufacturing.
Technometrics 34(1), 1–14 (1992)
Leroux, B.G., Lei, X., Breslow, N.: Estimation of disease rates in small areas: a new mixed model
for spatial dependence. Inst. Math. Appl. 116, 179–191 (2000)
Lewandowski, D., Kurowicka, D., Joe, H.: Generating random correlation matrices based on vines
and extended onion method. J. Multivar. Anal. 100(9), 1989–2001 (2009)
Martinez-Beneito, M.A.: A general modelling framework for multivariate disease mapping.
Biometrika 100(3), 539–553 (2013)
Moran, P.: Notes on continuous stochastic phenomena. Biometrika 37(1/2), 17–23 (1950)
Neelon, B., Ghosh, P., Loebs, P.F.: A spatial Poisson hurdle model for exploring geographic
variation in emergency department visits. J. R. Stat. Soc. Ser. A Stat. Soc. 176(2), 389–413
(2013)
Neelon, B., Zhu, L., Neelon, S.E.B.: Bayesian two-part spatial models for semicontinuous data
with application to emergency department expenditures. Biostatistics 16(3), 465–479 (2015)
Qiu, Y.: Spectra: sparse eigenvalue computation toolkit as a redesigned ARPACK. http://spectralib.
org (2017)
Rathbun, S.L., Fei, S.: A spatial zero-inflated Poisson regression model for oak regeneration.
Environ. Ecol. Stat. 13(4):409–426 (2006)
Recta, V., Haran, M., Rosenberger, J.L.: A two-stage model for incidence and prevalence in point-
level spatial count data. Environmetrics 23(2), 162–174 (2012)
Reich, B., Hodges, J., Zadnik, V.: Effects of residual smoothing on the posterior of the fixed effects
in disease-mapping models. Biometrics 62(4), 1197–1206 (2006)
Sanderson, C.: Armadillo: an open source C++ linear algebra library for fast prototyping and
computationally intensive experiments. Technical Report; NICTA (2010)
Singh, J.: A characterization of positive Poisson distribution and its statistical application. SIAM
J. Appl. Math. 34(3), 545–548 (1978)
Spiegelhalter, D.J., Best, N.G., Carlin, B.P., Van Der Linde, A.: Bayesian measures of model
complexity and fit. J. R. Stat. Soc. Ser. B Stat. Methodol. 64(4), 583–639 (2002)
Stroustrup, B.: The C++ Programming Language. Pearson Education, New Jersey (2013)
Tiefelsdorf, M., Griffith, D.A.: Semiparametric filtering of spatial autocorrelation: the eigenvector
approach. Environ. Plan. A 39(5), 1193 (2007)
Torabi, M.: Hierarchical multivariate mixture generalized linear models for the analysis of spatial
data: an application to disease mapping. Biom. J. 58(5), 1138–1150 (2016)
U.S. Census Bureau: 2020 Census operational plan: a new design for the 21st century (2015)
Ver Hoef, J.M., Jansen, J.K.: Space-time zero-inflated count models of harbor seals. Environ-
metrics 18(7), 697–712 (2007)
Wall, M.: A close look at the spatial structure implied by the CAR and SAR models. J. Stat. Plan.
Inference 121(2), 311–324 (2004)
74 D. Musgrove et al.
Wikle, C.K., Anderson, C.J.: Climatological analysis of tornado report counts using a hierarchical
Bayesian spatiotemporal model. J. Geophys. Res. Atmos. (1984–2012) 108(D24), 1–15 (2003).
https://doi.org/10.1029/2002JD002806
Young, D.S., Raim, A.M., Johnson, N.R.: Zero-inflated modelling for characterizing coverage
errors of extracts from the US Census Bureau’s Master Address File. J. R. Stat. Soc. Ser. A
Stat. Soc. 180(1), 73–97 (2017)
Wavelet Kernels for Support Matrix
Machines
Edgard M. Maboudou-Tchao
1 Introduction
E. M. Maboudou-Tchao ()
Department of Statistics, University of Central Florida, Orlando, FL, USA
e-mail: edgard.maboudou@ucf.edu
The standard support vector machines (SVM) aim at finding the optimal hyper-
plane that maximizes the margin between two classes. SVM are solved using
quadratic programming methods. SVM are easily extended to accept matrix as
input. For the two-class classification problem, let the training set be D =
((X1 , y1 ), . . . , (XN , yN )) ∈ (X × {−1, 1})N , where N is the number of matrices
and X ⊆ Rn ⊗ Rp is an original matrix input space. Rn and Rp are two vector
spaces. SMM consists in solving the following primal problem:
1
N
minimize tr(W W) + C ξj ,
W 2
j =1
(1)
subjectto yj tr(W ϕ(Xj )) + b ≥ 1 − ξj , j = 1, 2, . . . , N
ξj ≥ 0, j = 1, 2, . . . , N.
where W ∈ R n×p is the matrix of regression coefficients, tr(.) is the trace operator,
ξj are the slack variables, the parameter C > 0 is introduced to control the influence
of the slack variables, and ϕ is a function mapping data to a higher dimensional
Hilbert space.
Wavelet Kernels for Support Matrix Machines 77
1
N
f0 (W, b, ξ ) = tr(W W) + C ξj
2
j =1
and
fj (W, b, ξ ) = 1 − ξj − yj tr(W ϕ(Xj )) + b ≤ 0, j = 1, . . . , N.
N
- .
N
L(W, b, α, ξ ) = tr(W W) + C ξj − αj yj tr(W ϕ(Xj )) + b − 1 + ξj − γj ξj
2
j =1 j =1 j =1
N
N
= tr(W W) + αj 1 − yj tr(W ϕ(Xj )) + b + ξj C − αj − γj ,
2
j =1 j =1
(2)
where αj and γj are positive Lagrange multipliers. To construct the dual problem,
we need to determine the optimal W, ξ , and b in terms of the dual variables. We
achieve this by differentiating the Lagrangian with respect to the primal variables.
∂L
N
= 0 =⇒ W − αj yj ϕ(Xj ) = 0,
∂W
j =1
∂L
N
= 0 =⇒ αj yj = 0,
∂b
j =1
∂L
= 0 =⇒ C − αj − βj = 0,
∂ξj
and
ξj ≥ 0,
αj ≥ 0
and
γj ≥ 0,
and
γj ξj = 0.
Now, from the first equation of the complementary slackness condition, the
objects for which αj = 0 are not on the margin and do not impact the value of W.
On the other hand, the objects for which αj > 0 do impact the value of W. These
matrices Xj corresponding to αj > 0 are the support matrices. The support matrices
that correspond to matrices located on the decision boundary, with 0 < αj < C,
are the margin support matrices. The other support matrices, with αj = C, are the
non-margin support matrices.
The next step is to maximize the dual problem. Plugging W into the Lagrangian
L and taking into account that βj = C − αj , the dual problem becomes
N
1
N
Maximize αj − yi yj αi αj tr(ϕ(Xi ) ϕ(Xj )),
α 2
j =1 i,j =1
N
αj yj = 0.
j =1
Now, if we let K(X, Y) represent the inner product tr(ϕ(X) ϕ(Y)) in a higher
dimensional space, the dual problem becomes
Wavelet Kernels for Support Matrix Machines 79
N
1
N
Maximize αj − yi yj αi αj K(Xi , Xj ),
α 2
j =1 i,j =1
N
αj yj = 0.
j =1
N
W∗ = αj∗ yj ϕ(Xj ). (5)
j =1
The next step is to evaluate the offset b. b can be found by using a support matrix
Xj and the complementary slackness condition. Alternatively, this can also be
achieved using the set of all support matrices by finding an average over all support
matrices as
⎛ ⎞
1
⎝
b= yi − αj yj K(Xi , Xj )⎠ .
∗
(6)
Ns
i∈S j ∈S
N
y0 = sgn αi yi tr ϕ(Xi ) ϕ(X0 ) + b = sgn αi yi K(Xi , X0 ) + b .
i=1 i=1
(7)
data is unknown, and the -insensitive loss function. The first three loss functions
do not produce sparseness in the support vectors. To address that issue (Vapnik
1995) suggested the -insensitive loss function as an approximation to Huber’s loss
function that enables a sparse set of support vectors to be obtained. The -insensitive
loss function is defined as:
)
0 if |f (x) − y| <
L (y) = (8)
|f (x) − y| − otherwise.
1
N
minimize tr(W W) + C ξj+ + ξj− ,
W 2
j =1
subject to yj − tr(W ϕ(Xj )) + b ≤ + ξj+ , (10)
tr(W ϕ(Xj )) + b − yj ≤ − ξj− ,
ξj+ , ξj− ≥ 0, j = 1, 2, . . . , N.
where C > 0 is a pre-specified value and determines the trade-off between the
flatness of f and the amount up to which deviations larger than are tolerated, and
ξj+ , ξj− are slack variables representing upper and lower constraints on the outputs
of the system.
Using similar arguments to the previous section, Slater’s conditions are met and thus
strong duality holds. It follows that the duality gap is zero and the optimal values
of the primal and dual problems are equal. Consequently, it ensures that the original
Wavelet Kernels for Support Matrix Machines 81
problem (primal) can be solved through the Lagrange dual problem, which is usually
easier to solve than the primal. To solve the primal problem (10), we construct the
Lagrangian by using Lagrange multipliers. The Lagrangian is
1
N
N
Lp = tr(W W) + C ξj+ + ξj− − αi+ + ξj+ − yj + tr(W ϕ(Xj )) + b
2
j =1 j =1
N
N
− αi− − ξj− + yj − tr(W ϕ(Xj )) − b − ηj+ ξj+ + ηj− ξj−
j =1 j =1
(11)
It follows from the saddle point condition that the partial derivatives of Lp with
respect to the primal variables (W, b, ξj+ , ξj− ) have to vanish for optimality.
∂Lp
N
= 0 =⇒ W = (αj+ − αj− )ϕ(Xj ), (12)
∂W
j =1
∂Lp
N
= 0 =⇒ (αj+ − αj− ) = 0, (13)
∂b
j =1
∂Lp
= 0 =⇒ C − αj+ − ηj+ = 0. (14)
∂ξj+
∂Lp
= 0 =⇒ C − αj− − ηj− = 0. (15)
∂ξj−
1
+
N
Maximize − (αi − αi− )(αj+ − αj− )tr(ϕ(Xi ) ϕ(Xj ))
α + ,αj− 2
i,j =1
N
+ αj+ (yj − ) − αj− (yj + ),
j =1 j =1 (16)
subject to 0 ≤ αj+ , αj− ≤ C ∀j,
N
(αj+ − αj− ) = 0.
j =1
82 E. M. Maboudou-Tchao
It is not very clear in the SMM methods proposed by Luo et al. (2015) and Zheng
et al. (2017), the kernel function used. We propose one alternative of choice for
the kernel function. The optimization problem (16) only involves the patterns ϕ(X)
through the computation of inner products in feature space. There is no need to
compute the features ϕ(X) when one knows how to compute the dot products
directly. Instead of actually mapping each instance to a higher dimensional space
using a mapping function ϕ, Boser et al. (1992) propose to directly choose a
kernel function K(X, Y) that represents an inner product tr(ϕ(X) ϕ(Y)) in some
unspecified high dimensional space.
The key idea of the kernel technique, or the so-called kernel trick, is to invert the
chain of arguments, i.e., choose a kernel K rather than a mapping before applying
a learning algorithm. It is clear that not any symmetric function K can serve as a
kernel. The necessary and sufficient conditions of K : X × X → R to be a kernel
are given by Mercer’s theorem.
Theorem 4.1 (Mercer’s Theorem) Suppose K is a symmetric function such that
the integral operator
!
(TK f )(.) = K(., x)f (x) dx
X
Then
1. λi ∈ 1 , i ∈ N
2. ψi ∈ L∞ (X )
3. K can be expanded in a uniformly convergent series, i.e.,
∞
Mercer’s theorem not only gives necessary and sufficient conditions for K to be
a kernel, but also suggests a constructive way of obtaining features φi from a given
kernel K.
Theorem 4.2 (Mercer Kernels) The function K : X × X → R is a Mercer kernel
if, and only if, for each l ∈ N and X = (x1 , . . . , xl ) ∈ X l , the l × l matrix K =
l
K(xi , xj ) i,j =1 is positive semidefinite.
Theorem 4.3 (Mercer Condition, Mercer 1909) The symmetry function K(x, y)
/ function if and only if: for all function f = 0 which satisfies the
is a valid kernel
condition of X f 2 (x)dx < ∞, we need to satisfy the condition:
! !
K(y, x)f (x)f (y) dx dy ≥ 0. (28)
X X
If we can find a Mercer kernel K that computes an inner product in the feature
space F we are interested
in, we can use the kernel evaluations K(X, Y) to replace
the inner products tr ϕ(X) ϕ(Y) in the LS-SMM algorithm. Note that obtaining a
Mercer kernel in a matrix space is not as easy as in a vector space.
We will construct wavelet kernels that are admissible support matrix kernels, i.e.,
satisfy Mercer conditions. The support matrix kernel
function can be described as
the inner product of two matrices, K(X, Y) = tr ϕ(X) ϕ(Y) . Instead of working
with matrices, we will use vectors instead. The support vector kernel function can
also be described as translation-invariant kernels such as K(x, y) = K(x − y)
(Burges 1999). A function is an admissible support vector kernel function if it
satisfies the condition of Mercer (Theorem 4.3). However, it is very challenging
to decompose translation-invariant functions as the product of two functions and
then prove that they satisfy Mercer condition. So the next result gives us necessary
and sufficient conditions for translation-invariant kernels to be admissible support
vector kernels.
Theorem 4.4 (Smola et al. 1998; Burges 1999) The translation-invariant kernel
function is an admissible support vector kernel function if and only if the Fourier
transform of k(x) satisfies
!
−m
F [k(ω)] = (2π ) 2 exp (−j (ωx)) k(x)dx ≥ 0. (29)
Xm
Wavelet Kernels for Support Matrix Machines 85
n
p
m (X) = (xij ) (31)
i=1 j =1
% &
where X = xij i,j ∈ Rn ⊗ Rp is a matrix with entries xi,j .
We can build the admissible kernel function for matrix as
Theorem 4.5 Let be a base wavelet or mother wavelet, let a ≥ 0 be the dilation,
and c ∈ R be the translation. If X, Y ∈ Rn ⊗ Rp are two matrices, then the wavelet
kernels for matrices are
n
p
xij − c yij − c
K(X, Y) = . (32)
a a
i=1 j =1
Proof We just need to show that the wavelet kernels for matrices satisfy the
condition of Mercer, i.e., are admissible support matrix kernels.
First, let vec(X) = [x11 , . . . , xnp ] and vec(Y) = [y11 , . . . , ynp ] , and let set
x = [x1 , . . . , xN ] = vec(X) and y = [y1 , . . . , yN ] = vec(Y), where N = n × p,
then
n
p
xij − c yij − c
K(X, Y) =
a a
i=1 j =1
N
xi − c yi − c
=
a a
i=1
= K(x, y).
86 E. M. Maboudou-Tchao
Now, ∀f ∈ L2 (RN ),
!! !
N !
N
xi − c yi − c
K(x, y)f (x)f (y) dx dy = dx dy
RN ⊗RN RN a RN a
i=1 i=1
! 2
N
xi − c
= dx ≥ 0.
RN a
i=1
Therefore, K(X, Y) satisfies Mercer condition and is admissible support matrix
kernel. Consequently, it follows that we can build translation-invariant kernels as
follows:
n
p
xij − yij
K(X, Y) = . (33)
a
i=1 j =1
Now, we give an existing wavelet kernel function, the Mexican hat wavelet or
Sombrero wavelet, that can be used to construct translation-invariant wavelet
kernels. Note that the Mexican hat wavelet is sometimes called Marr wavelet and its
mother wavelet is
2
x
(x) = (1 − x 2 ) exp − . (34)
2
Figure 1 represents a 2-D plot of the Mexican Hat wavelet kernel function.
Theorem 4.6 Let a ≥ 0, if X, Y ∈ Rn ⊗ Rp are two matrices, then the Mexican
Hat wavelet kernel function for matrices is defined as:
n
p
xij − yij 2
(xij − yij )2
K(X, Y) = 1− exp − , (35)
a 2a 2
i=1 j =1
1.0
0.5
K(X)
0.0
−0.5
−4 −2 0 2 4
X
Proof According to Theorem 4.4, one just needs to show that the Fourier transform
of Mexican hat wavelet is nonnegative. Let x = [x1 , . . . , xN ] = vec(X) with N =
n × p, then it follows that
N x
i
K(x) =
a
i=1
N x 2
i 1 xi 2
= 1− exp − .
a 2 a
i=1
88 E. M. Maboudou-Tchao
/∞ /∞
where G1 (ω) = −∞ exp − 12 (xi + j aωi )2 dxi and G2 (ω) = −∞ xi2 exp
− 12 (xi + j aωi )2 dxi .
Wavelet Kernels for Support Matrix Machines 89
Then
N
{G1 (ω) − G2 (ω)} = (2π )N/2 a 2N ωi2N . (37)
i=1
and so
N
1 2 2
2 2
N
I = (2π ) N/2 3N
a exp − j a ωi ωi . (38)
2
i=1 i=1
Another wavelet kernel function that can be used is Morlet wavelet kernel function.
Its mother wavelet is defined as:
2
x
(x) = cos(ω0 x) exp − . (40)
2
1.0
0.8
0.6
0.4
K(X)
0.2
0.0
−0.2
−4 −2 0 2 4
X
Proof Using Theorem 4.4, we need to show that the Fourier transform of Morlet
wavelet is nonnegative. Let x = [x1 , . . . , xN ] = vec(X) with N = n × p, then it
follows that
N x
i
K(x) =
a
i=1
N x
i 1 xi 2
= cos ω0 exp − .
a 2 a
i=1
Wavelet Kernels for Support Matrix Machines 91
N $
(2π )1/2 (ω0 + aωi )2 (ω0 − aωi )2
= aN exp − + exp − .
2 2 2
i=1
0.8 0.8
1.0 1.0
0.6 0.6
K(X, Y)
0.5
K(X, Y)
0.5
0.4 0.4
0.0
0.0 0.2
0.2
4 −4
2 −2
−4 0.0
−4 0 0.0 0
−2 −2
Y
Y
0 −2 2 0
X 2 2 X −0.2
−0.2
4 −4 4 4
−0.4
−0.4
Fig. 3 3-D plot of wavelet kernels. (left) Mexican Hat wavelet kernel, (right) Morlet wavelet
kernel
5 Applications
We illustrate the use of the two wavelet kernel functions on two datasets. We apply
WSMM to EEG and image classification problems. The EEG alcoholism dataset
is concerned with the relationship between genetic predisposition and tendency
for alcoholism. The study involved two groups of subjects: an alcoholic and a
control group. Each subject was exposed to a stimulus while voltage values were
measured from 64 channels of electrodes placed on the subject’s scalp for 256 time
points. So each subject has measurements of electrical scalp activity, which form a
256 × 64 matrix. There are 77 subjects from the alcoholic group and 45 subjects
from the control group. In our application, we used both 10 subjects from the
alcoholic and control groups. The performance comparison was assessed in terms
of the classification accuracy. Both the Mexican Hat wavelet kernel function for
matrices (Eq. (35)) and Morlet wavelet kernel function for matrices (Eq. (41)) yield
an accuracy of 95%.
The second dataset used is the INRIA person dataset. This dataset was proposed
to detect whether or not people exist in an image. Each color image is converted
into a 160 × 96 gray level image and the pixel values are used as an input matrix
without any advanced feature extraction technique. We use a small subset of the
dataset. The training set has 60 positives and 30 negatives for a total of 90 matrices.
The test set consists of 55 positives and 25 negatives for a total of 80 matrices. The
Mexican Hat wavelet kernel function for matrices (Eq. (35)) gives a classification
accuracy of 87.5% while Morlet wavelet kernel function for matrices, (Eq. (41))
yields a classification accuracy of 88.5%.
Wavelet Kernels for Support Matrix Machines 93
6 Conclusion
This article proposed some new kernel functions of support matrix machines, Mex-
ican Hat, and Morlet wavelet kernel functions. These kernel functions were used
to map matrices from the low dimensional matrix space to some high dimensional
space. We establish by proving that these kernels are valid or admissible kernels.
The method was successfully applied to EEG and INRIA image classification with
good performances.
References
Boser, B.E., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In:
Proceedings of the Fifth Annual Workshop of Computational Learning Theory, vol. 5, pp. 144–
152. ACM, Pittsburgh (1992)
Burges, C.J.C.: Geometry and invariance in kernel based methods. In: Scholkopf, B., Burges,
C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods–Support Vector Learning, pp. 89–116.
MIT, Cambridge (1999)
Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
Luo, L., Xie Y., Zhang, Z., Li, W.-J.: Support matrix machines. In: Proceedings of the 32nd
International Conference on Machine Learning (ICML), pp. 928–947 (2015)
Maboudou-Tchao, E.M.: Kernel methods for changes detection in covariance matrices. Commun.
Stat. Simul. Comput. (2017). http://dx.doi.org/10.1080/03610918.2017.1322701
Mercer, J.: Functions of positive and negative type and their connection with the theory of integral
equations. Philos. Trans. R. Soc. Lond. A 209, 415–446 (1909)
Shi, W., Zhang, D.: Support matrix machine for large-scale data set. In: International Conference
on Information Engineering and Computer Science, 2009. ICIECS 2009. 20, pp. 1191–1199
(2009)
Smola, A.J.: Regression estimation with support vector learning machines. Master’s thesis,
Technische Universitat Munchen (1996)
Smola, A., Scholkopf, B., Muller, K.-R.: The connection between regularization operators and
support vector kernels. Neural Netw. 11, 637–649 (1998)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995). ISBN 0-387-
94559-8
Wu, F., Zhao, Y.: Least squares support vector machine on Morlet wavelet kernel function and its
application to nonlinear system identification. Inf. Technol. J. 5(3), 439–444 (2006)
Xia, W., Fan, L.: Least squares support matrix machines based on bilevel programming. Int. J.
Appl. Math. Mach. Learn. 4(1), 1–18 (2016)
Zhang, Q., Benveniste, A.: Wavelet networks. IEEE Trans. Neural Netw. 3(6), 889–898 (1992)
Zhang, L., Zhou, W., Jiao, L.: Wavelet support vector machine. IEEE Trans. Syst. Man Cybern. B
Cybern. 34, 34–39 (2004)
Zheng, Q., Zhu, F., Qin, J., Chen, B., Heng, P.H.: Sparse support matrix machine. Pattern Recogn.
76, 1–12 (2017)
Properties of the Number of Iterations
of a Feasible Solutions Algorithm
1 Introduction
Many approaches exist to identify interaction effects in data sets with small
to moderate size. Classical statistical methods suggest considering all pairwise
combinations of possible explanatory variables in the proposed logistic regression or
linear regression model, and selecting a set of variables based on either hypothesis
tests or a model selection criterion. Although the theory supporting these techniques
is developed, often the data sets of interest have an inordinate number of possible
explanatory variables to consider in higher-order interactions using the conventional
implementations of classical methods.
For example, genomic data is also unique in its complexity due to intricate
dependencies among genes and traits, often in the form of information from external
influences or genetic makeup that are unaccounted for during analysis. In particular,
interaction effects among genes contribute to epistasis, which is especially difficult
to identify in genomic data (Moore and Williams 2009). These interactions have
been targets of recent analyses of genomic data although development of methods
to identify higher-order interaction effects has been more limited due to compu-
tational concerns (Gemperline 1999). For example, even using the second largest
S. A. Janse
Center for Biostatistics, The Ohio State University, Columbus, OH, USA
K. L. Thompson ()
Department of Statistics, University of Kentucky, Lexington, KY, USA
e-mail: katherine.thompson@uky.edu
2 Background
Issues from the complex nature of interaction effects, coupled with the size of
data, cause theoretical and computational problems when classical methods are
applied using standard implementations. To address these limitations, some recent
work has been focused on revisiting versions of the Feasible solutions algorithm
Properties of the Number of Iterations of a Feasible Solutions Algorithm 97
3 Methods
In FSA, each random start begins with an arbitrary model with a fixed number of
predictors and proceeds by taking steps to better models based on some optimization
criteria, e.g., R 2 . The algorithm proceeds until it reaches an optimal model for a
given random start. Thus, each random start, or replication of FSA, will have at least
one step, but often times will have several more. FSA is not guaranteed to identify
the optimal solution, but as the analyst increases the number of random starts, FSA
is more likely to do so. Thus, we need enough random starts to obtain the optimal
solution with some probability. However, the larger the number of random starts,
the longer the time it will take FSA to run. Therefore, it would be highly useful to
have information regarding how many random starts to choose in order to obtain
the optimal solution with some probability while still maintaining computational
efficiency.
As the number of explanatory variables, p, in a data set increases, it is more
difficult to identify the optimal solution and will require more random starts. We
propose choosing the number of random starts as a function of p. As p goes to
infinity, the probability that the optimal solution is identified by FSA is bounded
below. The limit described in Theorem 1 holds for FSA in the case of considering
m-way interactions.
Theorem 1 In the case of using FSA to find a statistically significant m-way
interaction in a predictive model, as the number of potential explanatory variables,
p, goes to infinity, a lower bound on the probability of identifying the statistically
optimal model in cp random starts, where 0 < c < 1, is 1 − e−cm .
2
Lemma
k tx
lim 1 + = etk
x→∞ x
and so the probability of obtaining the optimal solution in the first step of a given
random start is
p−m
1 − m
p . (2)
m
For a given random start, FSA completes at least one step, and often more than
one step, before reaching a feasible solution. Since we are only considering finding
the statistically optimal solution after the first step and not considering the cases
where we could find the optimal solution in later steps, Eq. (2) will be a lower bound
on the probability of identifying the statistically optimal solution in a single random
start. So, the probability of obtaining the statistically
p−m optimal solution
p−m in cpat least
cp
( m ) ( m )
one of the cp random starts is greater than 1 − , where is the
(mp ) (mp )
probability that none of the random starts identify the optimal solution in the first
step of FSA. So we consider
p−m cp
lim 1 − m p
p→∞
m
p−m cp
= lim 1 − lim m
p
p→∞ p→∞
m
(p − m)! m!(p − m)! cp
= 1 − lim
p→∞ m!(p − 2m)! p!
(p − m)!(p − m)! cp
= 1 − lim
p→∞ p!(p − 2m)!
cp
(p − m)!
= 1 − lim
p→∞ (p − 2m)!p(p − 1) · · · (p − m + 1)
(p − m)(p − m − 1) · · · (p − 2m + 1) cp
= 1 − lim .
p→∞ p(p − 1) · · · (p − m + 1)
Notice that both the numerator and denominator in the limit statement contain m
quantities. Thus we can write the last line above as
p − m cp p − m − 1 cp p − 2m + 1 cp
= 1 − lim ···
p→∞ p p−1 p−m+1
cp cp
p−m p−m−1 p − 2m + 1 cp
= 1 − lim lim · · · lim
p→∞ p p→∞ p−1 p→∞ p − m + 1
cp cp cp
m m m
= 1 − lim 1 − lim 1 − · · · lim 1 − .
p→∞ p p→∞ p−1 p→∞ p−m+1
100 S. A. Janse and K. L. Thompson
Then we have
m cp
lim 1 − = e−cm
p→∞ p
cp
m
lim 1− = e−cm .
p→∞ p−1
Next,
cp
m
lim 1−
p→∞ p−m+1
c(p−m+1) c(m−1)
m m
= lim 1 − 1− .
p→∞ p−m+1 p−m+1
c(p−m+1)
Since limp→∞ 1 − p−m+1
m
= e−cm by the lemma with t = c and
c(m−1)
k = −m and limp→∞ 1 − p−m+1
m
= 1, we have
cp
m
lim 1− = e−cm .
p→∞ p−m+1
So,
p−m cp
1 − lim m
p = 1 − e−cm × e−cm × · · · e−cm (mtimes)
p→∞
m
= 1 − e−c(m ) .
2
Properties of the Number of Iterations of a Feasible Solutions Algorithm 101
1.2
l
1.0
of c, and the lines show the
ll
l c = 0.20
l
l l c = 0.40
asymptotic lower bound on l
ll
lll
llllll
0.8
llllllllllllllllll
the probability of getting the l
lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
0.6
lll
llllll
llllllllllllllll
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
is attained very quickly and l
0.4
l
ll
the statistically optimal lll
llllll
lllllllllllllllllllllll
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
solution increases as the
0.2
number of random starts
increases, as expected l
ll
llllllllll
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
ll
lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
0.0
1 20 40 60 80 100
p
Thus,
Figure 1 shows how the calculated probability of obtaining the optimal solution
approaches the lower bound derived above for 5 values of c with m = 2. It can
be seen that the lower bound is attained very quickly and thus is appropriate when
considering data sets with a large number of explanatory variables, p. It is also clear
that the probability of obtaining the statistically optimal solution increases as the
number of starts increases, as is expected.
4 Results
Simulation studies were performed for both quantitative and binary response
variables to examine the outcomes of utilizing the lower bound derived above. These
simulations were followed by a real data analysis to demonstrate the use of the lower
bound in practice.
102 S. A. Janse and K. L. Thompson
4.1 Simulations
Quantitative trait data were simulated as the sum of two covariates and their
interaction under the typical regression model for values of p of 50, 100, 1000, and
2500. One hundred data sets were simulated for each value of p. Binary trait data
were simulated in an analogous manner. Simulations parameters are as follows:
• Quantitative response variable (lmFSA)
– Xij ∼ U (0, 1) for i = 1, . . . , n and j = 1, .., p
– Yi = 5 + Xi1 + Xi2 + 2Xi1 Xi2 + i where i ∼ N (0, 1)
• Binary response variable (glmFSA)
– Xij ∼ U (0, 1) for i = 1, . . . , n and j = 1, . . . , p
Xi1 +Xi2 +2Xi1 Xi2
– πi = e Xi1 +Xi2 +2Xi1 Xi2
)
(1+e )
1 with probability πi
– Yi =
0 with probability 1 − πi
FSA was used to provide a set of feasible solutions for every simulated data
set via the implementation in Lambert et al. (2018). Exhaustive search was then
performed to find the statistically optimal solution using R 2 and AI C as the
criteria functions for the quantitative and binary response variables, respectively.
The numbers of random starts chosen for FSA were values of c including 0.01,
0.02, 0.1, 0.2, and 0.4 with each value of p. Then, for each simulation setting, the
percentage of simulated data sets producing the statistically optimal solution using
FSA was calculated. These percentages, along with the lower bound from Sect. 3,
are plotted in Fig. 2.
Figure 2 shows the results from 100 simulated data sets for both methods in FSA
for four values of p and five values of c. Note that the asymptotic lower bound
proposed here depends only on m and c. The red dots represent these lower bounds
for each value of c. The yellow diamonds, blue dots, green squares, and black
triangles represent the percentage of 100 simulations with p = 50, 100, 1000, and
2500, respectively, when m = 2 [where FSA was able to identify the statistically
optimal solution]. It is clear from Fig. 2 that the lower bound is often much lower
than the observed probability and is thus very conservative, but does provide a good
guidance as to the number of random starts needed to produce at least one feasible
solution containing the statistically optimal solution.
Data were collected in a genome-wide association study using 288 outbred mice in
a study that aimed to identify, or map, locations along the genome called SNPs that
influence HDL cholesterol, systolic blood pressure, triglyceride levels, glucose, or
Properties of the Number of Iterations of a Feasible Solutions Algorithm 103
1.0
l
l l
l
l
l
Probabilty of Obtaining Optimal Solution
0.8
l l
l
l
0.6
0.6
ll
l l
0.4
0.4
l l
0.2
p = 50 p = 50
p = 100 p = 100
l l p = 1000 l l p = 1000
l l
0.0
0.0
p = 2500 p = 2500
1 10 20 40 1 10 20 40
% of p (c) % of p (c)
Fig. 2 Simulation results for the probability of getting the optimal solution in 100 simulations
with a quantitative response variable (left) and a binary response variable (right): for each value
of c, the lower bounds are represented by the red dots and the probability of identifying the
statistically optimal solution for the four values of p is represented by yellow diamonds (p = 50),
blue dots (p = 100), green squares (p = 1000), and black triangles (p = 2500). Both the left and
right plots show that the lower bound is valid for all values of p in the simulation study
urinary albumin-to-creatinine ratios (Zhang et al. 2012). Our goal was to determine
if SNPs or interactions of SNPs were associated with HDL levels. Information from
3045 SNPs on chromosome 11 were analyzed for this real data analysis.
Using the lower bound in Theorem 1, if we want the probability of obtaining the
statistically optimal solution including a 2-way interaction to be at least 95%, then
we need to solve the following equation for c:
1 − e−c2 = 0.95
2
⇐⇒ c = 0.7489331
Table 1 The exhaustive search produced the single statistically optimal solution with R 2 =
0.1256308 (column 3)
Variable 1 Variable 2 R2
mb2863979 mb87344525 0.1256308
Columns 1 and 2 show the SNPs that were identified in this model
Table 2 FSA produced 33 feasible solutions and a subset of those are shown here, including the
statistically optimal solution denoted in bold with R 2 = 0.1256308 (column 3)
Variable 1 Variable 2 Times chosen by FSA R2
mb104327194 mb91638370 42 0.0957401
mb13136127 mb31255782 898 0.1245719
mb28636979 mb87344525 107 0.1256308
mb111935889 mb43233761 25 0.1065257
mb62443411 mb99541026 23 0.1088855
mb112250554 mb96331482 56 0.1123864
Columns 1 and 2 show the SNPs that were identified in each of the models
be found in the supplemental materials.) Out of the 2284 replications of FSA, the
statistically optimal solution was identified in 107 of the replications, showing that
the number of random starts used here was sufficient.
times to run FSA will improve the computational usability of FSA by allowing the
user to choose fewer random starts based on the desired likelihood of obtaining the
statistically optimal solution while still being computationally feasible, and continue
providing a valid alternative to exhaustive search methods.
References
Friedman, J., Hastie, T., Tibshirani, R.: glmnet: Lasso and elastic-net regularized generalized linear
models. R package version 1(4) (2009)
Gemperline, P.J.: Computation of the range of feasible solutions in self-modeling curve resolution
algorithms. Anal. Chem. 71(23), 5398–5404 (1999)
Goudey, B., Abedini, M., Hopper, J.L., Inouye, M., Makalic, E., Schmidt, D.F., Wagner, J., Zhou,
Z., Zobel, J., Reumann, M.: High performance computing enabling exhaustive analysis of
higher order single nucleotide polymorphism interaction in genome wide association studies.
Health Inf. Sci. Syst. 3(1), 1 (2015)
Hawkins, D.M.: The feasible set algorithm for least median of squares regression. Comput. Stat.
Data Anal. 16(1), 81–101 (1993)
Hawkins, D.M.: A feasible solution algorithm for the minimum volume ellipsoid estimator in
multivariate data. Comput. Stat. 8, 95–95 (1993)
Hawkins, D.M.: The feasible solution algorithm for least trimmed squares regression. Comput.
Stat. Data Anal. 17(2), 185–196 (1994)
Hawkins, D.M.: The feasible solution algorithm for the minimum covariance determinant estimator
in multivariate data. Comput. Stat. Data Anal. 17(2), 197–210 (1994)
Hawkins, D.M., Olive, D.J.: Improved feasible solution algorithms for high breakdown estimation.
Comput. Stat. Data Anal. 30(1), 1–11 (1999)
Lambert, J., Gong, L., Elliot, C.F., Thompson, K., Stromberg, A.: rFSA: an R package for finding
best subsets and interactions. R J. 10(2), 295–308 (2018)
Lumley, T., Miller, A.: Leaps: regression subset selection. R package version 2 (2004)
Miller, A.J.: Selection of subsets of regression variables. J. R. Stat. Soc. Ser. A Gen. 147(3), 389–
425 (1984)
Moore, J.H., Williams, S.M.: Epistasis and its implications for personal genetics. Am. J. Hum.
Genet. 85(3), 309–320 (2009)
Zhang, W., Korstanje, R., Thaisz, Staedtler, F., Harttman, N., Xu, L., Feng, M., Yanas, L., Yang,
H., Valdar, W., et al.: Genome-wide association mapping of quantitative traits in outbred mice.
G3: Genes Genomes Genetics 2(2), 167–174 (2012)
A Primer of Statistical Methods
for Classification
1 Introduction
class of Iris
Iris-virginica
Iris-versicolor
Iris-setosa
class of Iris
Iris-virginica
Iris-versicolor
Iris-setosa
shows a scatterplot of iris petal-length versus petal-width. Once again setosa are
clearly separated from the other two varieties, but there is still some overlap between
virginica and versicolor possibly leading to misclassification.
In this article, we aim to provide basic description of the most well-known
and commonly used classification methods that are used to develop classifiers
(or classification rules) based on relation between the response variable Y and
explanatory variables X, which then are used to assign new objects to these known
groups based on observed x’0 . Two soft classifiers (logistic regression and naïve
Bayes estimator) and four hard classifiers (linear discriminant analysis, support
vector machines, K nearest neighbor, and classification trees), respectively, are
described in Sects. 2 and 3 along with their strengths and weaknesses. Some
discussion assessing performance of these classifiers for five different datasets, three
real and two simulated, is provided in Sect. 4. Some concluding remarks about
choice of classifiers in practice are provided in Sect. 5.
2 Soft Classifiers
Intuitively, a soft classifier should appeal to anyone who likes to incorporate the
uncertainty of outcome provided by classifiers because it also shows the likelihood
of a new observation being a member of different classes. Here two most commonly
used soft classifiers, namely logistic regression and naïve Bayes classifiers are
discussed.
p
πi1
logit (πi1 ) = log = β0 + βj xj i (2)
1 − πi1
j =1
n
y
L β0 , β1 , . . . , βp = πi i (1 − πi )n−yi .
i=1
Since there are no closed-form solutions available for maximizing this likelihood
function, iterative algorithms are used to obtain the ML estimates of regression
parameters. According to Agresti (2013), the most popular choices for iterative
algorithms are either the Newton–Raphson algorithm (Tjalling 1995) or itera-
tively reweighted least square (IRWLS) algorithm (Burrus et al. 1994). However,
sometimes due to the use of too many explanatory variables or highly correlated
explanatory variables, these algorithms fail to converge resulting in failure to
estimate parameters. Another counter-intuitive situation sometimes occurs when
there is a complete separation between two classes using some linear combination
of explanatory variables. More information on estimating parameters of logistic
regression is available in Menard (2002).
A simple extension of logistic regression from binary to multiclass classification
is known as multinomial logistic regression. The multinomial logistic model is
given by,
p
exp β0g + j =1 βjg xj i
πig = p , g = 1, 2, . . . , G − 1 and i = 1, 2, . . . , n.
1 + G−1
g=1 exp β0g + j =1 βjg xj i
(3)
G−1
baseline and π iG can be obtained using the fact that πiG = 1 − g=1 πig . From
the point of estimation, there are (p + 1)(G − 1) model parameters to be estimated.
For estimating these parameters, the ML estimation or the maximum a posteriori
(MAP) methods are commonly used (Murphy 2012). Estimation method MAP is
similar to ML in the sense that it chooses that value of parameter which maximizes
the value of a mathematical function, in this case the posterior distribution of the
parameter itself. Most of the times, a closed-form solution is not available, hence
different algorithms are used for estimation and IRWLS is a popular choice among
practitioners.
If the G ≥ 2 classes are ordered using an ordinal response variable, an alternative
popular model often used in practice is the proportional-odds cumulative logit
model. For example, consider a typical Likert scale question where the responders
are asked to grade certain experience on a scale of 1 to 5 with 1 being the worst
rating and 5 being the best. It might be of interest to determine if there exist some
explanatory variables that can explain how the responders rate their experience. First
developed by Snell (1964), this model is given by,
g
πic
p
Lig = log Gc=1 = β0g + βj xj i , for g = 1, 2, . . . , G and i = 1, 2, . . . , n.
c=g+1 πic j =1
(4)
Naïve Bayes (NB) is a family of soft classifiers that uses the Bayes theorem (Bayes
1763) along with a very strong assumption of independence among explanatory
variables which is often unrealistic. However, this classifier works very well in
the presence of dependencies among many categorical explanatory variables (Rish
2001) and is quite fast to execute even with large datasets.
NB classifier differs from the logistic regression classifier in terms of how
the probability π ig is modeled. When using a logistic regression classifier,
π ig = P(yi = g| xi ) is modeled directly from data. On the other hand, when using
a Naïve Bayes classifier, first the estimates for P(xi | yi = g) are obtained from
data and then assuming independence among explanatory variables, π ig is modeled
using Bayes theorem as,
p
πig ∝ P (yi = g) P xj i |yi = g , g = 1, 2, . . . , G and i = 1, 2, . . . , n. (5)
j =1
A Primer of Statistical Methods for Classification 113
The estimate for P(yi = g) can be obtained from the training set as the proportion
of training set observations that belong to class g (g = 1, 2, . . . , G). The estimates
for P(xji | yi = g) are typically obtained via ML estimation technique. A new
observation is assigned to a group for which probability π ig is maximum among
all G groups.
Estimating parameters from the likelihood function depends on how the likeli-
hood, P(xji | yi = g), i = 1, 2, . . . , n, j = 1, 2, . . . , p, and g = 1, 2, . . . , G
is modeled parametrically. If Xj is a continuous random variable, then the popular
choice of distribution is normal (Gaussian) such that (Xj | Y = g)∼N(μg , g ). If Xj
is a categorical random variable with m categories, then the most commonly used
distribution is multinomial, i.e., (Xj | Y = g)∼Multinomial(1, φ 1g , . . . , φ mg ) for one
φmlg , l = 1, 2, . . . m is the probability associated with the l category
trial where th
3 Hard Classifiers
P (X|Y = g)
P (Y = g|X) = P (Y = g) for g = 1, 2
P (X)
where pg = P(Y = g) for g = 1, 2 is the overall class probability and can be estimated
from the training data. Under the assumption that the explanatory variables are
multivariate normal, the hyperplane can be found by solving the following equation
for r,
p1 1 ’ −1
log + r’ −1 (μ1 − μ2 ) − μ1 μ1 − μ2’ −1 μ2 = 0. (7)
p2 2
Solution to (7) leads to a linear classifier (or a linear boundary between two
groups) because Eq. (6) is a linear function of explanatory variables. The first step
in LDA is to estimate the mean vectors (μ1 and μ2 ) and variance–covariance matrix
() from the training dataset. For any new observation, x0 , one can estimate x0
from (8) as,
ˆ x0 = log
p̂1 ˆ −1 μ̂1 − μ̂2 − 1 μ̂’1
+ x0’ ˆ −1 μ̂1 − μ̂’2
ˆ −1 μ̂2 . (8)
p̂2 2
As can be seen from (7) and (9), a QDA requires more parameters to be estimated
from the training dataset, precisely (2 + 2p + 2p2 ) parameters for QDA compared
to (2 + 2p) for LDA. That can lead to a serious issue if the training dataset is small.
To overcome this issue, Srivastava et al. (2007) proposed an effective Bayesian
solution.
A simpler method under the assumption of homoscedasticity of variance–
covariance matrices is to use Mahalanobis distance (Mahalanobis 1936) for clas-
sification. For any new observation, x0 , a linear discriminant function LDFg
is computed for each group (see (10)) under the assumption that μg and ,
respectively, are the unknown mean vector and variance–covariance matrix of X.
ˆ −1 1 ’ ˆ −1
LDF g (x0 ) = x0’ 0 μ̂g − μ̂g 0 μ̂g + p̂ g (10)
2
Assumed to have originated in long past, the history of K(1 < K < n) nearest
neighbor (KNN) classification is not really that well known. In modern times,
Sebestyen (1962) described this method as proximity algorithm and Nilsson (1965)
called it the minimum distance classifier. Cover and Hart (1967) were the first to
name this algorithm as the nearest neighbor and that name became popular.
Although mostly used as a hard classifier, KNN can be used as a soft classifier
too. The idea behind KNN is quite simple and no parametric assumption is required.
Given a training dataset of size n(n > K), this classification algorithm starts when
a new observation, x0 = (x01 , . . . , x0p ) , is recorded with known values for all
explanatory variables but unknown class. The first step is to calculate the K nearest
neighbors in terms of the explanatory variables. Using some well-defined distance
measure, distance di = d(x0 , xi ), i = 1, 2, . . . , n between this new observation
and each observation from the training dataset is calculated and these distances
are ordered as d(1) ≤ d(2) ≤ . . . ≤ d(n) . Considering the lowest K distances,
{d(1) , d(2) , . . . , d(K) }, the class membership of these closest K neighbors in the
training dataset is determined. Then the new observation is placed in the class that
116 R. Dey and M. S. Mulekar
has the largest number of these K neighbors. For example, suppose kg of the nearest
K neighbors belong to group g(g = 1, 2, . . . , G) such that G g=1 kg = K, then
the new observation is placed in the group c if kc = max {kg , g = 1, 2, . . . , G}.
Note that there is a possibility that no such unique maximum exists for a given
new observation and a chosen K, thus resulting in ties. Although not exactly a
group inclusion probability, these nearest neighbors can be used to provide a group
membership indicator of the new observation using relative fractions (kg /K), g = 1,
2, . . . , G.
Now the question is: how to choose value of K, the number of nearest neighbors
to be used? Given a large dataset one can always use cross-validation and choose the
K value corresponding to the lowest misclassification rate in the validation dataset.
Note that choice of a too small value for K indicates that the space generated
by the explanatory variables is divided into many small subspaces and the class
membership of a new observation depends on which subspace the new observation
belongs to. In that case outliers in the original dataset can create problems in
predicting the class membership of a new observation that is close to the outlier
resulting in a higher variance in prediction. However, choice of a large value for K
basically leads to division of the training data space into G smooth subspaces which
in turn creates the problem of misclassification of any outlier of these√ subspaces and
subsequently higher bias in prediction. As a rule of thumb, K = n is considered
to be a sensible choice for number of classes in practice. If the number of groups in
the data is 2 (i.e., G = 2), then K should be an odd number to avoid the possibility
of ties in group membership indicators.
The most popular choice for a distance measure is the Euclidean distance which
for a new observation, x0 = (x01 , . . . , x0p ) , is calculated as,
0
1
1 p 2
di = 2 x0i − xj i , i = 1, 2, . . . , n. (11)
j =1
Some other distance measures used commonly in practice are Hamming distance
(Hamming 1950) and Chebyshev distance (Grabusts 2011). Chomboon et al.
(2015) looked at eleven different distance measures and found that the Euclidean,
Chebyshev, and Mahalanobis distance measures perform well. For a synopsis of
different distance measures, please refer to Mulekar and Brown (2014). Different
explanatory variables tend to have different range of possible values and some
distance measures such as the Euclidean distance tend to be affected by the
range of measurements. Hence in practice, datasets are typically normalized before
classification to reduce the influence of explanatory variables with larger range of
measurements. When using a dataset with a large number of explanatory variables,
to reduce the computation time, a dimension reduction technique such as principal
component analysis (PCA) is used. To overcome the problem of choosing a value
for K, Samworth (2012) suggested the use of weighted nearest neighbor algorithm in
which instead of choosing K nearest neighbors (i.e., essentially assigning a weight
A Primer of Statistical Methods for Classification 117
Support vector machine (SVM) is a class of hard classifiers. For a binary clas-
sification with p explanatory variables, an SVM classifier constructs a (p − 1)-
dimensional hyperplane in the pth dimension to maximize the margin. Here margin
refers to the distance between the observation closest to the boundary of a group
and the remaining groups. Points on or closest to the boundary of decision surface
are called support vectors and they are used in learning models associated with
the classification algorithm. The idea behind SVM is to find that hyperplane which
provides the maximum margin from support vectors among infinitely many possible
hyperplanes that can separate two groups provided the two groups are completely
separable. For a binary classification, one hyperplane known as the maximum
margin hyperplane is constructed. For G > 2 groups, more than one such maximum
margin hyperplanes need to be created to separate groups and a combination of these
hyperplanes is used for the classification of a new observation.
Consider the case of binary classification, and assume that there actually exists a
linear hyperplane of the form
p
W (X) = w0 + wj Xj (12)
j =1
that can perfectly differentiate between two classes. Then a method described by
Vapnik and Lerner (1963) can be used to find a maximum margin hyperplane.
Maximum margin hyperplane is a hyperplane for which W(X) = 0. In SVM, only
support vectors obtained using the training data are used to estimate the coefficients
of explanatory variables in (12). Since the decision surface differentiates the classes
completely, the linear function in (12) should be positive for one group and negative
for another. Without any loss of generality, assume that for support vector(s) in
group 1, Ŵ (X) = −1 and for those in groupp 2, Ŵ (X) = 1. In order to maximize
the margin, it is sufficient to minimize j =1 wj2 subject to vi Ŵ (xi ) ≥ 1, i =
1, 2, . . . , n where vi = − 1 if yi = 1 and vi = 1 if yi = 2. Thus this hyperplane
can be obtained by minimizing the Lagrangian formulation,
n
L=− ai (vi W (xi ) − 1)
i=1
118 R. Dey and M. S. Mulekar
p
1
n
θ wj2 + max 0, 1 − vi Ŵ (xi ) (13)
n
j =1 i=1
p
2
κ (xi , xl ) = exp ⎝−ξ xij − xlj ⎠ .
j =1
To develop a multiclass SVM classifier (i.e., for G > 2), there are few options
available. In a method known as one-against-all, G SVM classifiers are obtained
for each class separately and a new observation is assigned to the class chosen
by maximum number of these classifiers (Bottou et al. 1994). In another method
known as one-against-one (Kressel 1998), G(G − 1)/2 SVM classifiers each
separating a pair of classes are obtained and a new observation is assigned to the
class that is predicted by the most classifiers (Kressel 1998). Hsu and Lin (2002)
who compared their performances concluded that one-against-one performs better
A Primer of Statistical Methods for Classification 119
than one-against-all in most of the situations that they studied. There are many
modifications of SVM proposed by researchers from different fields that work better
in certain specific situations. Typically, SVM works really well if there exists a good
separation between classes or when the number of explanatory variables is large
compared to the sample size of the training dataset. SVM is not computationally
effective when using a very large training dataset.
Classification trees (CT) are methods used to partition the space of explanatory
variables into disjoint subsets and assign a class to each subset by minimizing some
measure of misclassification also known as impurity. It is a visually pleasing method
and can be easily as well as effectively described to those from the non-scientific
communities. CT can handle large datasets as well as missing data, and it can easily
ignore bad explanatory variables. However, sometimes depending on the dataset CT
can produce a really bad partition of the space of explanatory variables leading to
high misclassification rates.
CT produces a flowchart or tree structure starting with a root node (one
explanatory variable) and then, proceeds with splits (internal nodes) until no split
is deemed necessary (leaf nodes). Each leaf node is assigned to a class. There are
many algorithms on how to select a root node, how to split a node, how many splits
of each node are needed, and when to stop splitting a node to make it a leaf node.
An example of classification tree is shown in Fig. 3. It shows classification of a
random sample of n = 78 from iris data by R.A. Fisher (Anderson 1935) using JMP
12 into one of the three classes using two explanatory variables petal-width and
petal-length.
The root node is typically chosen with an explanatory variable that provides
the lowest rate of misclassification. This is easily achieved when the number of
explanatory variables is small. For example, let there be two classes (G = 2) and
one explanatory variable (p = 1). Consider the rule for using two complementary
subgroups Ag created by a split with the explanatory variable such that, yi = g if
x1i ∈ Ag , g = 1, 2. For each split a Gini impurity measure is computed as,
⎛ ⎞
1
I (CT ) = ⎝1 − q
g (j )⎠
g=0 j =0
where
n
I yi = j ; x1i ∈ Ag
q
i=1
g (j ) =
n
I x1i ∈ Ag
i=1
120 R. Dey and M. S. Mulekar
is the misclassification rate for group g and the splits are chosen such that the
Gini impurity measure is minimized (Witten et al. 2011). As shown in Fig. 3,
LogWorth for each model defined as −log10 (p-value) is also another measure used
to decide where to split. The p-value can be based on a chi-square test for a split that
maximizes LogWorth value.
The first classification tree algorithm was proposed by Messenger and Mandell
(1972). However, Breiman et al. (1984) have provided what became the most
popular classification tree algorithm, namely classification and regression trees
(CART). Several improved versions of it were proposed later and are still used in
practice. Many modifications of the CART method have been proposed for various
reasons, but mainly because CART produces biased and high-variance trees, i.e.,
changing the training set can drastically change the tree diagram. A few Bayesian
versions of this algorithm are also available in the literature (Chipman et al. 1998;
Denison et al. 1998). Loh (2009) provides a classification algorithm, called GUIDE
which is computationally faster and incorporates nearest neighbor algorithm to
improve the CT.
To reduce the variance in CT, two new methods were proposed, namely the
bagging method (Breiman 1996) and the random forests method (Breiman 2001).
Note that these methods do not produce a tree diagram but they focus on obtaining
many classification trees from the training data so that a new observation that needs
to be classified is put into the class suggested by majority of these trees. Bagging is
simply achieved by obtaining bootstrap samples with replacement from the training
A Primer of Statistical Methods for Classification 121
data. Random forests are similar to bagging but in each tree only a randomly
√
chosen subset of typically p number of explanatory variables is considered when
determining the nodes.
Freund and Schapire (1997) introduced the concept of boosting which aims to
reduce bias in a CT. This is achieved by refitting the data into trees with higher
weights for misclassified data points. In the initial calculation of the first CT, all
data points are given equal weight. However, the weights are updated after each
iteration and the impurity measure is updated by assigning higher weights to the
misclassified observations. Then the final classifier is selected via weighted average
of the trees.
G
H =− P (Y = g) log (P (Y = g))
g=1
and
G
Hc = − P Y = g, Ŷ = l log P Y = g|Ŷ = l
g=1 l=1
122 R. Dey and M. S. Mulekar
can be estimated from the training data. Larger values of U indicate a better
classifier. If the accuracy of classification for only one class is very important, then
one can calculate the sensitivity (also known in medicine as the true positive rate
or in machine learning as the recall rate) for that class. Sensitivity also takes values
between 0 and 1 but a good classifier is expected to have a higher sensitivity. The
sensitivity for class g can be calculated as,
n
i=1 I (yi = g, y
3i = g)
seng = n n
i=1 I (yi = g, 3
yi = g) + i=1 I (yi = g, y
3i = g)
Table 1 Comparison of misclassification rates for different classifiers for five datasets
Iris Skin Glass Bivariate normal Bivariate uniform
Logistic 0.09804 0.0401 NA 0.288 0.495
LDA 0.03922 0.0505 0.2364 0.300 0.495
NB 0.03922 0.0245 0.4545 0.304 0.077
KNN 0.03922 0.0048 0.1818 0.002 0.000
SVM 0.05882 0.0051 0.3273 0.312 0.008
CT 0.07843 0.0232 0.2545 0.320 0.096
Table 2 Comparison of uncertainty measure for different classifiers for five datasets
Iris Skin Glass Bivariate normal Bivariate uniform
Logistic 0.71463 0.72119 NA 0.12654 0.00116
LDA 0.86424 0.71310 0.52570 0.11903 0.00116
NB 0.88589 0.80570 0.43788 0.11408 0.66483
KNN 0.88589 0.94258 0.62021 0.96238 1.00000
SVM 0.79531 0.93916 0.25449 0.06982 0.93602
CT 0.78020 0.80141 0.52453 0.10053 0.54546
not separable using any reasonable curve and provides a challenge in terms of
classification (see Fig. 4). Both datasets have two explanatory variables as presented
in scatterplots in Fig. 4 where each class is represented by separate point type and
color. For NB classifier,
√ Gaussian prior was used. For KNN classifier, the next larger
odd integer to n was used as the value of K, except in one real data example
(skin data), where this value was too large due to large dataset. The observed
misclassification rates for six methods and five examples are listed in Table 1 and
the uncertainty coefficients are listed in Table 2.
124 R. Dey and M. S. Mulekar
Example 1 (Iris) Consider the famous iris dataset by R.A. Fisher (Anderson 1935)
described in the Introduction. Fifty observations are available for each type of
iris. Of the 150 measurements available, a training dataset of 99 observations was
created with 33 observations each from three groups. The misclassification rate
was estimated based on the remaining 51 observations that constituted a validation
sample. Misclassification rates lower than 0.10 (Table 1) and uncertainty measures
over 0.70 (Table 2) show that all the methods did a commendable job of correct
classification for this data. However, NB, KNN, and LDA are slightly better than
other classifiers.
Example 2 (Skin) Refer to the skin segmentation dataset from the UCI machine
learning repository (Bhatt et al. 2009). This dataset contains 245,057 observations
randomly sampled from photos of faces of people of different age group, gender,
and color. Of those, 50,589 observations are for samples of skin while the rest are
for samples of non-skin parts of the face. The three explanatory variables in this
example are RGB triplet, i.e., red, green, and blue colors used in displaying images.
RGB values are typically given as an integer value in the range of 0–255, and
combined together they determine the color of the image which in this case is part
sampled. Distribution of RGB pixels for skin data is presented in a 3-dimensional
plot in Fig. 5. Without rotating the plot around three axes it is difficult to tell if
there is clear distinction or some overlap between two groups, skin and non-skin. A
training sample of size 150,000 was used, out of which 30,000 were skin samples.
Tables 1 and 2 show that KNN and SVM perform best for this data, followed by NB
and CT. To save computation time, K = 19 was used for KNN algorithm.
Example 3 (Glass) Consider the glass dataset from UCI machine learning repos-
itory (Lichman 2013). The original dataset describes six types of glass samples
along with the refractive index and weight percent of oxides formed with sodium,
magnesium, aluminum, silicon, potassium, calcium, barium, and iron in the sample.
Since some of the classes have small sample sizes, only three types of glass
were used for classification purpose in this example. They are float-processed
building window glasses (70 measurements), nonfloat-processed building window
glasses (76 measurements), and non-window headlamp glasses (29 measurements).
Fifty samples each from two classes of building window glasses and 20 samples
from headlamp were used as the training data. Simulations to obtain parameters
for a multinomial logistic regression failed due to non-convergence of iterations.
Outcomes in Tables 1 and 2 indicate that KNN performs the best followed by CT
and LDA.
Example 4 (Bivariate Uniform) Consider two independent univariate uniform
distributions, namely Xi ∼ Uniform(−1.25, 1.25) for i = 1, 2. A sample of 3000
observations was generated with seed 1234. With the unit circle providing the class
boundary, the i-th observation is assigned to group 1 if x1i2 + x 2 ≤ 1 and to group 2
2i
otherwise. The first 2000 observations generated were used as a training data and the
remaining 1000 as a validation sample. In this training dataset, 1012 observations
were from group 1 and the remaining 998 from group 2. In the validation dataset,
520 observations were from group 1 and the remaining 480 from group 2. Note that
in this situation a linear classifier is not supposed to perform well because of non-
normal distributions and that is reflected in the misclassification rates listed in Table
1 and uncertainty measures listed in Table 2. Logistic regression and LDA seems
to be only as good as a coin toss in this situation whereas KNN and SVM perform
admirably well.
Example 5 (Bivariate Normal) Now consider the bivariate normal populations. A
sample of 1500 observations from two homoscedastic bivariate normal distributions
that differ only in mean vector was generated using the mvtnorm package in R
with seed 5678, resulting in a total sample of size 3000. The difference in the
mean vectors and the variance–covariance matrix used in the simulation were,
respectively,
1.0 1.0 0.5
μ1 − μ2 = and = .
1.0 0.5 1.0
The first 1000 rows were used as a training sample for both groups (resulting in a
total sample size of 2000) and the remaining 500 as the validation sample (resulting
total sample size 1000). Tables 1 and 2 show KNN as a clear winner while the other
classifiers are almost equally bad. Although we expected LDA to perform better, to
our surprise the results say otherwise.
126 R. Dey and M. S. Mulekar
5 Concluding Remarks
In this paper, the basic ideas that dominate the world of statistical classification were
described. Detailed discussions of them are scattered in different textbooks, but none
discusses them all together. For example, logistic regression is discussed in detail
by Kleinbaum and Klein (2010), LDA by McLachlan (2004), SVM by Steinwart
and Christmann (2008), classification trees by Breiman et al. (1984), and different
classification methods by Izenman (2008) and James et al. (2013).
For data with highly correlated explanatory variables or a large number of
explanatory variables, the use of some dimension reduction technique such as
principal component analysis, low variance filter, and high correlation filter before
classification is recommended (Farcomeni and Greco 2015). In cases where p > n,
dimension reduction becomes necessary. Alternatively, although random forests
method is not a dimension reduction technique for explanatory variables, in cases
√
where p < n this method can be effectively used without reducing dimension of
explanatory variables.
A very basic question on this topic should be about the preference for any
particular classification method. Alternatively, should there be preference for a
certain classification method over the others. It depends on the circumstances.
There is no single method that stands out as the best. Typically for complex
problems in which the misclassification rate is higher among all classifiers, the
use of soft classifiers is recommended. However, hard classifiers remain popular
as their outcomes are easier to interpret in practice. Also hard classifiers like SVM
and KNN generally provide good outcomes as seen from the situations discussed
here. In this age of computation, the most recent research emphasis is on effective
ways of implementing bagging and random forests (James et al. 2013) which can
be computationally more effective than other classifiers like KNN. Liu et al. (2011)
describe a suave large-margin unified machine that combines margin-based hard and
soft classifiers, and that hard classifiers tend to perform better than soft classifiers
when the classes are either easily separable or when the training sample size is
relatively small compared to number of explanatory variables.
Research over the years has led to the development of many classifiers. As a
result, the toolbox from which a classifier can be chosen provides an extensive list of
options which to some extent depends on software used by and the computing power
available for the researcher. Also comparative performance of different classifiers
is changing with changing technology and results of past studies might lead to
different conclusions with the current technology. Thus, one can entertain the idea
of using all possible classifiers and assign a new observation to a class assigned by
most of the classifiers.
A Primer of Statistical Methods for Classification 127
References
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160
(1950)
Hand, D.J.: Classifier technology and the illusion of progress. Stat. Sci. 21, 1–14 (2006)
Hand, D.J.: Assessing the performance of classification methods. Int. Stat. Rev. 80, 400–414 (2012)
Hand, D.J., Yu, K.: Idiot’s Bayes - not so stupid after all? Int. Stat. Rev. 69(3), 385–399 (2001)
Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse
of dimensionality. Theory Comput. 8, 321–350 (2012)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference and Prediction. Springer, New York (2001)
Hosmer, T., Hosmer, D.W., Fisher, L.L.: A comparison of the maximum likelihood and dis-
criminant function estimators of the coefficients of the logistic regression model for mixed
continuous and discrete variables. Commun. Stat. 12, 577–593 (1983)
Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE
Trans. Neural Netw. 13(2), 415–425 (2002)
Izenman, A.J.: Modern Multivariate Statistical Techniques: Regression, Classification and Mani-
fold Learning. Springer, New York (2008)
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: With
Applications in R. Springer, New York (2013)
Kiang, M.: A comparative assessment of classification methods. Decis. Support. Syst. 35, 441–454
(2003)
Kleinbaum, D.G., Klein, M.: Logistic Regression: A Self-learning Text, 3rd edn. Springer, New
York (2010)
Kressel, U.H.G.: Pairwise classification and support vector machines. In: Advances in Kernel
Methods: Support Vector Learning, pp. 255–268. MIT Press, Cambridge (1998)
Lange, K.: MM Optimization Algorithms. SIAM, Philadelphia (2016)
Lichman, M.: UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of
California, Irvine 2013
Liu, W.Z., White, A.P.: A comparison of nearest neighbor and tree-based methods of non-
parametric discriminant analysis. J. Stat. Comput. Simul. 53, 41–50 (1995)
Liu, Y., Zhang, H.H., Wu, Y.: Hard or soft classification? Large-margin unified machines. J. Am.
Stat. Assoc. 106(493), 166–177 (2011)
Loh, W.Y.: Improving the precision of classification trees. Ann. Appl. Stat. 3, 1710–1737 (2009)
Mahalanobis, P.C.: On the generalized distance in statistics. Proceedings of the National Institute
of Science in India, 2(1), 49–55, (1936)
Mai, Q., Zou, H.: Semiparametric sparse discriminant analysis in ultra-high dimensions. J.
Multivar. Anal. 135, 175–188 (2015)
Mantel, N.: Models for complex contingency tables and polychotomous response curves. Biomet-
rics. 22, 83–110 (1966)
McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley-Interscience,
New York (2004)
McLachlan, G.J., Byth, K.: Expected error rates for logistic regression versus normal discriminant
analysis. Biom. J. 21, 47–56 (1979)
Menard, S.: Applied Logistic Regression Analysis, 2nd edn. Sage Publications, Thousand Oaks
(2002)
Meshbane, A., Morris, J.D.: A method for selecting between linear and quadratic classification
models in discriminant analysis. J. Exp. Educ. 63(3), 263–273 (1996)
Messenger, R., Mandell, L.: A modal search technique for predictive nominal scale multivariate
analysis. J. Am. Stat. Assoc. 67, 768–772 (1972)
Mills, P.: Efficient statistical classification of satellite measurements. Int. J. Remote Sens. 32, 6109–
6132 (2011)
Mulekar, M.S., Brown, C.S.: Distance and similarity measures. In: Alhaji, R., Rekne, J. (eds.)
Encyclopedia of Social Network and Mining (ESNAM), pp. 385–400. Springer, New York
(2014)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
A Primer of Statistical Methods for Classification 129
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled
documents using EM. Mach. Learn. 39(2-3), 103–134 (2000)
Nilsson, N.: Learning Machines: Foundations of Trainable Pattern-Classifying Systems. McGraw-
Hill, New York (1965)
Quenouille, M.H.: Problems in plane sampling. Ann. Math. Stat. 20(3), 355–375 (1949)
Quenouille, M.H.: Notes on bias in estimation. Biometrika. 43(3-4), 353–360 (1956)
Rao, R.C.: The utilization of multiple measurements in problems of biological classification. J. R.
Stat. Soc. Ser. B. 10(2), 159–203 (1948)
Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI Workshop on Empirical
Methods in AI, Sicily, Italy (2001)
Samworth, R.J.: Optimal weighted nearest neighbour classifiers. Ann. Stat. 40(5), 2733–2763
(2012)
Schölkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V.: Comparing
support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Trans.
Signal Process. 45, 2758–2765 (1997)
Sebestyen, G.S.: Decision-making Process in Pattern Recognition. McMillan, New York (1962)
Snell, E.J.: A scaling procedure for ordered categorical data. Biometrics. 20, 592–607 (1964)
Soria, D., Garibaldi, J.M., Ambrogi, F., Biganzoli, E.M., Ellis, I.O.: A non-parametric version of
the naive Bayes classifier. Knowl.-Based Syst. 24(6), 775–784 (2011)
Srivastava, S., Gupta, M.R., Frigyik, B.A.: Bayesian quadratic discriminant analysis. J. Mach.
Learn. Res. 8, 1287–1314 (2007)
Steel, S.J., Louw, N., Leroux, N.J.: A comparison of the post selection error rate behavior of the
normal and quadratic linear discriminant rules. J. Stat. Comput. Simul. 65, 157–172 (2000)
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
Theil, H.: A multinomial extension of the linear logit model. Int. Econ. Rev. 10(3), 251–259 (1969)
Tjalling, J.Y.: Historical development of the Newton-Raphson method. SIAM Rev. 37(4), 531–551
(1995)
Tukey, J.W.: Bias and confidence in not quite large samples. Ann. Math. Stat. 29(2), 614–623
(1958)
Vapnik, V., Lerner, A.: Pattern recognition using generalized portrait method. Autom. Remote.
Control. 24, 774–780 (1963)
Wahba, G.: Soft and hard classification by reproducing Kernel Hilbert space methods. Proc. Natl.
Acad. Sci. 99, 16524–16530 (2002)
Witten, I., Frank, E., Hall, M.: Data Mining. Morgan Kaufmann, Burlington (2011)
A Doubly-Inflated Poisson Distribution
and Regression Model
1 Introduction
The Poisson distribution is the standard choice for modeling probabilities of count
data and can be modeled against covariates using the generalized linear models
framework (McCulloch and Searle 2001). A common extension of these models
is to account for over-dispersion, or situations where the count variability exceeds
the count mean, see Cameron and Trivedi (1998) for a summary of approaches to
verify and analyze over-dispersed count data. Another extension of the basic Poisson
model is to account for excess zeros. Both Cohen (1963) and Johnson and Kotz
(1969) describe the zero-inflated Poisson (ZIP) distribution, and Lambert (1992)
extended this distribution to account for covariates that could simultaneously model
the counts as well as the probability of particular zeros being in excess. More details
on Poisson regression models and zero-inflated models are provided by Hall (2000),
Bae et al. (2005), Coxe et al. (2009), and Hall and Shen (2010).
One example of count data is a patient’s length of stay (LOS) before hospital
discharge, which is often used as a simple count measure of the cost of care
(Marazzi et al. 1998). The use of LOS as a measure of cost assumes that a patient
with a larger LOS is of more cost to a hospital than a patient with a shorter LOS.
M. Sheth-Chandra
Center for Global Health, Old Dominion University, Norfolk, VA, USA
e-mail: msheth@odu.edu
N. R. Chaganty ()
Department of Mathematics and Statistics, Old Dominion University, Norfolk, VA, USA
e-mail: rchagant@odu.edu
R. T. Sabo
Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA
e-mail: rsabo@vcu.edu
This measure is a natural candidate for a ZIP model, as many patients are served in
outpatient scenarios for which LOS is recorded as zero (LOS = 0). However, other
frequencies can be inflated as well. Consider the data presented in Table 1, which
shows a sample LOS data. While the frequency of zero-valued LOS is certainly
high, we notice that LOS = 3 is also inflated relative to all other values. This may
be due to patients staying three nights at the hospital to receive inpatient treatment
followed by recovery time. As another example, consider the dental epidemiology
data presented in Bohning et al. (1997, 1999). Here, the decayed, missing, filled teeth
(DMFT)-index is used to measure the dental status of 1013 school children of age
7, which is a count of the number of decayed, missing, or filled teeth. This study
focused on the eight deciduous molars, and the DFMT was measured at baseline
and at 1 year. The goal of the study was to compare the absolute change in DMFT
(δ(DMF T ) = |DMF T1 −DMF T2 |) between four types of decay prevention, their
combination, and a control, in the presence of other covariates. In this case zero was
inflated as most of the children exhibited no change in dental status, and as such the
zero count—corresponding to no improvement and/or consistent dental care—was
inflated. However, δ(DF MT ) scores of one were also inflated. The count of one is
for children that showed improvement in only one cavity.
In both the length of stay and dental epidemiology examples, there were inflated
instances of count k > 0 in addition to an inflation of zeros. As ZIP models can
only account for excess zeros, here we introduce a doubly-inflated Poisson (DIP)
distribution. This distribution can probabilistically model excessive zeros as well as
excess counts at some positive integer value k. This model can also be adapted to
the generalized linear models framework to account for covariates in a regression-
like setting in a similar manner as the ZIP model. The added advantage of the DIP
is that it can model covariates against the preponderance of both excess zeros and
k values in addition to the count outcome. Lin and Tsai (2013) studied excessive
zero and k responses in the context of health survey data. However, they did not
consider method of moments estimation and comparisons with maximum likelihood
using asymptotic relative efficiency criteria. They also did not derive analytically the
elements for Fisher information as we did in this paper.
This chapter is organized as follows. The DIP distribution is presented in
Sect. 2. In addition to the distributional form and parameterization of the DIP
model, likelihood and moment-based estimation processes are outlined and their
asymptotic efficiencies are compared. The hospital LOS data are used to exemplify
this distribution. The DIP model is extended to the generalized linear model
A Doubly-Inflated Poisson Distribution and Regression Model 133
2.1 Parameterization
The log-likelihood function of the DI P (p, λ) model given in (1) can be written as
(p, λ|y) = n0 log[f (0; p, λ)] + nk log[f (k; p, λ)] + log[f (yi ; p, λ)],
i: yi =0,k
(2)
where n0 represents the number of zero counts and nk represents the number of
k counts. Differentiating log-likelihood function (2) with respect to the p and
λ (assuming k is known) and solving the resulting score functions for those
parameters yield the maximum likelihood (ML) estimates (3 p, 3λ). The Newton-
Raphson algorithm can be used to find numerical solutions as the score equations are
p) and σ 2 (3
not in the closed form. The asymptotic variances σ 2 (3 λ) of the maximum
likelihood estimates p3 and 3 λ, respectively, are obtained by taking the diagonal
elements of the inverse of the Fisher information matrix. The analytical formulas for
the elements of the Fisher information matrix are given in Appendix “Information
Matrix”.
Method of Moments
To obtain moment estimates for p and λ, we note that the first two population
moments are μ1 = E(Y ) = 2pqk + q 2 λ and μ2 = E(Y 2 ) = 2pqk 2 + q 2 λ + λ2 .
We equate these to their corresponding sample moments y 1 = n1 ni=1 yi and
y 2 = n1 ni=1 yi2 , which yields the following two equations:
)
y1 = 2pqk + q 2 λ
(3)
y2 = 2pqk 2 + q 2 λ + λ2 .
Solving Eq. (3) numerically for p and λ using Newton-Raphson yields moment
p, 4
estimators (4 λ) for p and λ, respectively. If we let
⎛ ∂μ ∂μ ⎞
1 1
⎜ ∂p ∂λ ⎟
D = ⎝ ∂μ ∂μ ⎠ ,
2 2
∂p ∂λ
then the asymptotic covariance matrix for the moment estimators (4p, 4
λ) is given
−1
T −1
by A = (D) D (see Theorem A.1 in Chaganty and Shi (2004)). The
diagonal elements of covariance matrix A are the asymptotic variances σ 2 (4
p) and
σ 2 (4 4 and 4
λ) of the moment estimators p λ, respectively.
Table 2 Asymptotic relative efficiencies for DIP (p, λ) model for inflated 0’s and 3’s
e(4 3)
p, p e(4
λ, 3
λ)
p λ=3 λ=5 λ=7 λ=9 λ=3 λ=5 λ=7 λ=9
0.1 0.0766 0.0859 0.4343 0.6629 0.7709 0.2219 0.5807 0.6988
0.2 0.2926 0.2592 0.6384 0.8174 0.2930 0.3895 0.6172 0.6973
0.3 0.0819 0.4404 0.7524 0.8854 0.2784 0.4869 0.6215 0.6712
0.4 0.2594 0.5853 0.8205 0.9223 0.4875 0.5419 0.6089 0.6302
0.5 0.4475 0.6953 0.8659 0.9454 0.5946 0.5746 0.5892 0.5829
0.6 0.6106 0.7808 0.9007 0.9614 0.6555 0.5951 0.5678 0.5352
0.7 0.7427 0.8499 0.9297 0.9735 0.6938 0.6086 0.5469 0.4907
0.8 0.8478 0.9075 0.9551 0.9835 0.7197 0.6180 0.5276 0.4495
0.9 0.9319 0.9569 0.9784 0.9921 0.7384 0.6248 0.5101 0.4132
136 M. Sheth-Chandra et al.
Table 3 Asymptotic relative efficiencies for DIP (p, λ) model for inflated 0’s and 6’s
e(4 3)
p, p e(4
λ, 3
λ)
p λ=3 λ=5 λ=7 λ=9 λ=3 λ=5 λ=7 λ=9
0.1 0.1877 0.0152 0.0023 0.1468 0.3659 0.6813 0.0178 0.3197
0.2 0.3803 0.0274 0.1170 0.3544 0.5086 0.6890 0.3013 0.4399
0.3 0.5413 0.1692 0.2978 0.5081 0.4549 0.6916 0.4411 0.4854
0.4 0.6653 0.3464 0.4619 0.6211 0.3350 0.6928 0.5071 0.5055
0.5 0.7617 0.5090 0.5969 0.7105 0.2449 0.6934 0.5442 0.5158
0.6 0.8363 0.6461 0.7074 0.7851 0.1866 0.6939 0.5676 0.5216
0.7 0.8939 0.7595 0.7991 0.8491 0.1486 0.6942 0.5837 0.5251
0.8 0.9382 0.8537 0.8764 0.9053 0.1229 0.6944 0.5953 0.5273
0.9 0.9727 0.9328 0.9426 0.9552 0.1047 0.6945 0.6041 0.5288
Fig. 1 Log-likelihood of LL
LOS data using DIP model 500
with parameters (p, λ)
1000
1500
2000
2500
1.0
0.8
12 0.6
10 8 0.4
6 4 0.2
2 0.0
LAMBDA P
the target parameters. However, in this case we see that the efficiencies for λ here
are never larger than 0.7, implying that the moment estimators are not nearly as
efficient as the ML estimators.
Returning to the LOS data from Table 1, recall that there were inflated counts
of patients who are received in outpatient settings (n0 = 55), as well as inflated
counts of patients receiving inpatient care for 3 days (n3 = 75). This phenomenon
of double inflation makes the DI P (p, λ) a natural choice to model the inflation
and count parameters. The log-likelihood for this model is presented in Fig. 1, with
A Doubly-Inflated Poisson Distribution and Regression Model 137
exp(−λi )λki
(pi , λi |y) = log pi2 + qi2 exp(−λi ) + log 2pi qi + qi2
k!
{i:yi =0} {i:yi =k}
exp(−λi )λyi
i
+ log qi2 , (6)
yi !
{i:yi =0,k}
n
p p
(γ , β) = −2 log 1 + exp(x i γ ) + log exp(2x i γ ) + exp(− exp(x λi β))
i=1 {i:yi =0}
exp(kx λi β − exp(x λi β))
p
+ log 2 exp(x i γ ) +
k!
{i:yi =k}
+ yi x λi β − exp(x λi β) − log(yi !) .
{i:yi =0,k}
γ ,3
The maximum likelihood estimate (3 β) is the solution to the equations
∂(γ , β) ∂(γ , β)
= 0, = 0.
∂γ ∂β
covariate patterns (e.g., factorial experiments). As shown in Table 5, here our data
consists of the frequencies n0l , n1l , . . . , nml , where nj l represents the frequencies
j(j = 0, . . . , m), where m is the largest observed frequency and
of count
n = sl=1 m j =1 nj l .
The binomial proportion pl and Poisson mean λl can be parameterized by using
the logit and log link functions similar to that provided in (5) as follows:
p
logit(pl ) = x l γ and log(λl ) = x λl β, (7)
p
where x l is the set of covariates in the lth covariate pattern used to model p, x λl is
the set of covariates in the lth covariate pattern used to model λ, and where γ and
β are again the vectors of corresponding regression coefficients. The log-likelihood
function then becomes
exp(−λl )λkl
(pl , λl |nj l ) = nj l log pl2 + ql2 exp(−λl ) + nj l log 2pl ql + ql2
k!
{l:j =0} {l:j =k}
m exp(−λ )λj
l l
+ nj l log ql2 ,
j!
j =1
{l: =k }
m
p
(γ , β) = −2 nj l log 1 + exp(x l γ )
l:j =0
p
+ nj l log exp(2x l γ ) + exp(− exp(x λl β))
{l:j =0}
exp(kx λl β − exp(x λl β))
p
+ nj l log 2 exp(x l γ ) +
k!
{l:j =k}
m
+ nj l j x λl β − exp(x λl β) − log(j !) .
j =1
{l: =k }
140 M. Sheth-Chandra et al.
γ ,3
The ML estimates (3 β) are again found as the solution to the first-order derivatives
∂(γ , β) ∂(γ , β)
= 0, = 0.
∂γ ∂β
We now return to the dental epidemiology data (Bohning et al. 1997, 1999), a subset
of which is presented in Table 6. The aim of the study was to compare six methods of
school-based dental care: 1. oral health education, 2. enrichment of the school diet
with rice bran, 3. mouthwash with 0.2% of NaF solution, 4. oral hygiene, 5. all of
the four treatments, and 6. a standard care control. Gender and race/ethnicity groups
(White, Black, Others including predominantly Hispanic) were also considered.
Inspection across all 1013 child measures shows inflated counts at 0 and 1.
Thus, the DI P (p, λ) regression model is used to assess whether the δ(DMF T )
counts were associated with the treatment and covariates, accounting for possible
inflation at 0 or 1. For simplicity, both p and λ were modeled against the same
set of covariates (treatment, gender, and race/ethnicity). The regression parameter
estimates are provided in Table 7. We see that the combination of all four treatments
lowers the likelihood of inflated counts relative to the control, Black race/ethnicity
leads to a lower likelihood of inflated counts relative to children with Other or
Hispanic race/ethnicities, and there was no effect on the probability of inflation
due to gender. For the traditional Poisson distribution, we see that the education
treatment leads to lower expected counts than the control, and neither gender nor
race/ethnicity had a significant relationship with the expected δ(DMF T ) counts.
4 Conclusion
Information Matrix
Let Y be distributed as DI P (p, λ) given in (1). The Fisher information matrix for
this distribution is
I (p) I (p, λ)
I= , (8)
I (p, λ) I (λ)
and
∂ 2 log(f (y; p, λ))
I (λ) = −E
∂λ2
−p2 q 2 exp(−λ) q2 exp(−λ)λk−1
= + 1−
p2 + q 2 exp(−λ) λ (k − 1)!
2
q 2 exp(−λ)λk k exp(−λ)λk
+ × −1 q2
2pq(k!) + q 2 exp(−λ)λk λ k!
k k 2
2 exp(−λ)λ k
−q −1 − 2 . (9c)
k! λ λ
A Doubly-Inflated Poisson Distribution and Regression Model 143
where
2
V ar(Y ) = 2pqk 2 + q 2 (λ2 + λ) − 2pqk + q 2 λ ,
2
V ar(Y 2 ) = 2pqk 4 + q 2 λ4 + 6λ3 + 7λ2 + λ − 2pqk 2 + q 2 λ + λ2 , and
Cov(Y 2 , Y ) = Cov(Y, Y 2 )
= 2pqk 3 + q 2 λ3 + 3λ2 + λ − (2pqk + q 2 λ) 2pqk 2 + q 2 λ + λ2 .
The partial derivatives for the log-likelihood function for DI P (p, λ) regression of
raw counts are
∂(γ , β)
n p p
x i exp(x i γ )
p p
2x i exp(2x i γ )
= −2 p + p
∂γ
i=1
1 + exp(x i γ ) {i:y =0} exp(2x i γ ) + exp(− exp(x λi β))
i
p
2x i
p
exp(x i γ )
+ ,
{i:yi =k} p exp(kx λi β − exp(x λi β))
2 exp(x i γ ) +
k!
and
∂(γ , β)
−x λi exp(x λi β − exp(x λi β))
= p
∂β
{i:yi =0}
exp(2x i γ ) + exp(− exp(x λi β))
kx λi − x λi exp(x λi β) exp(kx λi β − exp(x λi β))
+
{i:yi =k} k! 2 exp(x p γ ) +
exp(kx λi β − exp(x λi β))
i
k!
+ yi x λi − x λi exp(x λi β) .
yi =y
{i:y=0,k }
144 M. Sheth-Chandra et al.
The partial derivatives for the log-likelihood function for DI P (p, λ) regression of
grouped frequencies are
∂(γ , β)
m p p
x exp(x l γ )
p p
2x l exp(2x l γ )
= −2 nj l l p + nj l p
∂γ
l:j =0
1 + exp(x l γ ) {l:j =0} exp(2x l γ ) + exp(− exp(x λl β))
p p
2x l exp(x l γ )
+ nj l ,
{l:j =k} p exp(kx λl β − exp(x λl β))
2 exp(x l γ ) +
k!
and
∂(γ , β)
x λl exp(x λl β − exp(x λl β))
=− nj p
∂β
{l:j =0}
exp(2x l γ ) + exp(− exp(x λl β))
λ
kx l − x λl exp(x λl β) exp(kx λl β − exp(x λl β))
+ nj l
{l:j =k} p exp(kx λl β − exp(x λl β))
k! 2 exp(x l γ ) +
k!
m
+ nj l j x λl − x λl exp(x λl β) .
j =1
{l: =k }
References
Bae, S., Famoye, F., Wulu, J.T., Bartolucci, A.A., Singh, K.P.: A rich family of generalized Poisson
regression models with applications. Math. Comput. Simul. 69, 4–11 (2005)
Böhning, D., Dietz, E., Schlattmann, P.: Zero-inflated count models and their applications in public
health and social science. In: Applications of Latent Trait and Latent Class Models in the Social
Sciences, pp. 333–344. Wasemann, Münster (1997)
Böhning, D., Dietz, E., Schlattmann, P., Mendonça, L., Kirchner, U.: The zero-inflated Poisson
model and the decayed, missing, and filled teeth index in dental epidemiology. J. R. Stat. Soc.
162, 195–209 (1999)
Cameron, A.C., Trivedi, P.K.: Regression Analysis of Count Data. Cambridge, New York (1998)
Chaganty, N.R., Shi, G.: A note on the estimation of autocorrelation in repeated measurements.
Commun. Stat. Theory Methods 33, 1157–1170 (2004)
Cohen, A.C.: Estimation in mixtures of discrete distributions. In: Proceedings of the International
Symposium on Discrete Distributions, Montreal, pp. 373–378. Statistical Pub. Society, Calcutta
(1963)
Coxe, S., West, S., Aiken, L.S.: The analysis of count data: a gentle introduction to Poisson
regression and its alternative. J. Pers. Assess. 91, 121–136 (2009)
Hall, D.: Zero-inflated Poisson and binomial regression with random effects: a case study.
Biometrics 56, 1030–1039 (2000)
Hall, D., Shen, J.: Robust estimation for zero-inflated Poisson regression. Scand. J. Stat. 37,
237–252 (2010)
A Doubly-Inflated Poisson Distribution and Regression Model 145
Johnson, N.L., Kotz, S.: Distributions in Statistics: Discrete Distributions. Houghton Mifflin,
Boston (1969)
Lambert, D.: Zero-inflated Poisson regression with an application to defects in manufacturing.
Technometrics 34, 1–14 (1992)
Lin, T.H., Tsai, M.-H.: Modeling health survey data with excessive zero and K responses. Stat.
Med. 32, 1572–1583 (2013)
Marazzi, A., Paccaud, F., Ruffieux, C., Beguin, C.: Fitting the distributions of length of stay by
parametric models. Med. Care 36, 915–927 (1998)
McCulloch, C.E., Searle, S.R.: Generalized, Linear, and Mixed Models. Wiley, New York (2001)
Sheth-Chandra, M.: The doubly inflated Poisson and related regression models. PhD Dissertation.
Old Dominion University, Norfolk (2011)
Multivariate Doubly-Inflated Negative
Binomial Distribution Using Gaussian
Copula
1 Introduction
2
... (−1)j1 +j2 +...+jp C(a1,j1 , a2,j2 , . . . , an,jp ) ≥ 0.
j1 =1 j2 =1 jp =1
C(u1 , u2 , . . . , up |R(r)) = R (−1 (u1 ), −1 (u2 ), . . . , −1 (up )), (1)
where −1 is the inverse CDF of a standard normal and R is the joint cumulative
distribution function of a standard multivariate normal distribution with covariance
matrix equal to the correlation matrix R.
Definition 3 The Gaussian copula density is defined as:
1 1
c(u1 , u2 , . . . , up |R(r)) = √ exp(− U T × (R(r)−1 − Ip ) × U ), (2)
|R(r)| 2
12
10
8
Density
0
1
0.8
0.6
0.4
1
0.2 0.8
0.6
0.4
u2 0 0.2
0
u1
2. AR-1 structure: Under this structure, the (i, j )th element of R(r) is given by
r |i−j | , with r ∈ (−1, 1). The inverse of this matrix is given below (Chaganty
1997)
1
R −1 (r) = (Ip − r 2 M2 − rM1 ), (4)
1 − r2
A copula may also be used to derive a joint distribution for discrete data. Given a
set of discrete, marginal distributions F1 (y1 |θ1 ), F2 (y2 |θ2 ), . . . , Fp (yp |θp ), one can
obtain the following joint probability mass function of Y = (Y1 , . . . , Yp ):
2
= ... (−1)j1 +j2 +...+jp C (u1,j1 , . . . , up,jp |R(r))
j1 =1 j2 =1 jp =1
(5)
Modeling count data are very popular in statistics. Both the Poisson and negative
binomial distributions are very popular to model count data. One simple approach
of introducing correlation among count variables is through common additive error
models. Kocherlakota and Kocherlakota (2001) and Johnson et al. (1997) provided
a detailed discussion for the one-factor multivariate Poisson model. Along this line
of study, Winkelmann (2000) proposed a multivariate negative binomial regression
model. In the case of zero and doubly-inflated count data, zero-inflated and doubly-
inflated multivariate Poisson models are available (Sen et al. 2017; Agarwal et al.
2002; Lee et al. 2009). In this section, we shortly review the model construction
process. For details, refer to Sen et al. (2017).
152 J. Mathews et al.
0.02
0.005 Z 0.01
Z
8
8 6
0 6 4 X
8 4 0
6 X 8 2
4 2 6
y 4
2 2 0
0 0 V
(a) (b)
Fig. 2 Bivariate negative binomial density using a Gaussian copula with s1 = 10, p1 = 0.50 and
s2 = 15, p2 = 0.60. (a) r = 0.10. (b) r = 0.90
where pr1 , pr2 ∈ (0, 1) and q = pr1 + pr2 < 1. Also, let Y = (Y1 , . . . , Yp ) be a
multivariate negative binomial random variable with mass function fC constructed
using a Gaussian copula as mentioned in Eq. (5).
The distribution functions of Y|Z and (Y, Z) are given below:
⎧
⎪
⎪ if z = 2, y = (0, . . . , 0)
⎨1
f1 (Y|Z) = 1 if z = 1, y = (k1 , . . . , kp ) (8)
⎪
⎪
⎩f if z = 0, y = (y1 , . . . , yp )
C (y1 , . . . , yp )
⎧
⎪
⎪ if z = 2, y = (0, . . . , 0)
⎨pr1
f2 (Y, Z) = pr2 if z = 1, y = (k1 , . . . , kp ) (9)
⎪
⎪
⎩(1 − q)f if z = 0, y = (y1 , . . . , yp )
C (y1 , . . . , yp )
Multivariate Doubly-Inflated Negative Binomial Distribution Using Gaussian Copula 153
This follows from the fact that Fs1 ,p1 (−1) = Fs2 ,p2 (−1) = 0 and the definition of
a copula. Using Eq. (10), the mass function of the bivariate doubly-inflated negative
binomial is given by:
⎧
⎪
⎨pr1 + (1 − q)C ((1 − p1 ) , (1 − p2 ) ),
⎪ y = (0, 0)
s1 s2
Z 0.2
0 0
0 2
2
4
4 X
6
y 6
Marginal Distributions
0.4
0.4
0.3
0.3
Density
Density
0.2
0.2
0.1
0.1
0.0
0.0
0 5 10 15 20 0 5 10 15 20
Y1 Y2
Using Eqs. (10) and (13), the expected value of the distribution can be computed
as below:
∞
∞
∞
= yi mi (yi )
yi =0
∞
= ki pr2 + (1 − q) yi fN B (yi )
yi =0
pi s i
= ki pr2 + (1 − q) (14)
1 − pi
where Ni is the negative binomial random variable with parameters pi and si . Using
Eqs. (14) and (15), we can find an expression for the variance:
pi si (1 + pi si )
V (Yi ) = ki2 pr2 + (1 − q)
(1 − pi )2
(16)
pi s i 2
− [ki pr2 + (1 − q)( )]
1 − pi
Now we will review and use maximum likelihood techniques to estimate the
parameters of our model.
0.10
0.10
0.08
0.08
0.06
0.06
Frequency
Frequency
0.04
0.04
0.02
0.02
0.00
0.00
0 10 20 30 40 50 60 0 10 20 30 40 50 60
y1 y2
Fig. 5 Histogram of marginal distributions for a bivariate doubly-inflated negative binomial model
with s1 = 10, s2 = 15, p1 = 0.30, p2 = 0.40 and inflation point at k = 10
2p + p2 parameters. Hence, by (14) the likelihood function for the doubly-inflated
negative binomial using a Gaussian copula is given by:
LR(r) (|Y ) = pr 1 + (1 − q)C ((1 − p1 )s1 , . . . , (1 − pp )sp |R(r))
i:Yi =(0,...,0)
× pr 2 + (1 − q)fC (k, . . . , k|R(r))
i:Yi =(k,...,k)
× (1 − q)fC (y1i , . . . , yni |R(r)) (17)
i:Yi =(y1i ,...,yni )
The maximum likelihood parameter estimates are given by the derivative of the
log of the likelihood equation in (17). Applying the logarithm function to both sides
of (17), we obtain the following:
6 Application
The DoctorAUS dataset can be found in the Ectdat package in R. The data comes
from a study done at an Australian hospital from 1977 to 1978 with n = 5190
observations. We consider the actdays = (0, . . . , 14) and illness = (0, . . . , 5)
variables. Here, actdays is the number of days of reduced activity from illness or
injury in the previous 2 weeks for a given patient and illness is the number of days
a given patient was sick in the previous 2 weeks. A table of counts for the two
variables is given in Table 2. Hence, there appears to be inflation points at (0, 0) and
(1, 0). In order to compute the asymptotic standard errors, we estimate pr 1 and pr 2
by finding the proportion of (0, 0) and (1, 0) counts rather than maximum likelihood
estimation. The results for the parameter estimates are given in Table 3. The results
for the three models are given in Table 4.
7 Conclusion
References
Agarwal, D.K., Gelfand, A.E., Citron-Pousty, S.: Zero-inflated models with application to spatial
count data. Environ. Ecol. Stat. 9(4), 341–355 (2002)
Antzoulakos, D.L., Philippou, A.N.: A note on the multivariate negative binomial distributions of
order k. Commun. Stat. Theory Methods 20, 1389–1399 (1991)
Bolker, B.: emdbook: Ecological Models and Data in R; R package version 1.3.9 (2016)
Brechmann, E.C., Czado, C., Kastenmeier, R., Min, A.: A mixed copula model for insurance claims
and claim sizes. Scand. Actuar. J. 2012(4), 278–305 (2011)
Chaganty, N. Rao.: An alternative approach to the analysis of longitudinal data via generalized
estimating equations. J. Stat. Planning Inference 63, 39–54 (1997)
Davidon, W.C.: Variable metric method for minimization. SIAM J. Optim. 1(1), 1–17 (1991)
Doss, D.C.: Definition and characterization of multivariate negative binomial distribution. J.
Multivar. Anal. 9(3), 460–464 (1979)
Ismail, N., Faroughi, P.: Bivariate zero-inflated negative binomial regression model with applica-
tions. J. Stat. Comput. Simul. 87(3), 457–477 (2017)
Joe, H.: Asymptotic efficiency of the two-stage estimation method for copula-based models. J.
Multivar. Anal. 94, 401–419 (2005)
Multivariate Doubly-Inflated Negative Binomial Distribution Using Gaussian Copula 161
Joe, H.: Dependence Modeling with Copulas. Chapman & Hall/CRC Monographs on Statistics &
Applied Probability. Chapman and Hall/CRC, Boca Raton (2014)
Johnson, N.L., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley, New York
(1997)
Karlis, D., Ismail, N.: Bivariate Poisson and diagonal inflated bivariate Poisson regression models
in R. J. Stat. Softw. 14(10) (2005)
Karlis, D., Nikoloulopoulos, A.: Modeling multivariate count data using copulas. Commun. Stat.
Simul. Comput. 39(1), 172–187 (2009)
Kocherlakota, S., Kocherlakota, K.: Regression in the bivariate Poisson distribution. Commun.
Stat. 30(5), 815–825 (2001)
Lee, J., Jung, B.C., Jin, S.H.: Tests for zero inflation in a bivariate zero-inflated Poisson model.
Statistica Neerlandica 63, 400–417 (2009)
Nelsen, R.B.: An Introduction to Copulas. Springer Series in Statistics. Springer, Berlin (2006)
Olkin, I., Pratt, J.W.: Unbiased estimation of certain correlation coefficients. Ann. Math. Stat.
29(1), 201–211 (1958)
Sen, S., Sengupta, P., Diawara, N.: Doubly inflated Poisson model using Gaussian copula.
Commun. Stat. Theory Methods 10(0), 1–11 (2017)
Sengupta, Pooja, Chaganty, N.R., Sabo, Roy T.: Bivariate doubly inflated Poisson models with
applications. J. Stat. Theory Pract. 10(1), 202–215 (2015)
Sklar, A.: Fonctions de répartition à n dimensionset leurs marges. Publ. Inst. Statis. Univ. Paris 8,
229–231 (1959)
So, S., Lee, D., Jung, B.C.: An alternative bivariate zero-inflated negative binomial regression
model using a copula. Econ. Lett. 113(2), 183–185 (2011)
Song, P.X.K.: Correlated Data Analysis: Modeling, Analytics, and Applications, 1st edn., vol. 365.
Springer Series in Statistics. Springer, New York (2007)
Winkelmann, R.: Seemingly unrelated negative binomial regression. Oxford Bull. Econ. Stat.
62(4), 553–560 (2000)
Quantifying Spatio-Temporal
Characteristics via Moran’s Statistics
1 Introduction
2 Spatio-Temporal Process
where N (Rd ) is the class of locally finite measures on R+ and D is the domain
area. Hence D − i t denotes the area around the point xit . Doing so, we have a local
precise measurement of the density and neighborhoods. We then build a partition of
D into subareas called time dependent Voronoi collection of D, Dit where a finite
measure or “locally finite” measure is such that μDit < ∞ for all i ≥ 1.
Later, we consider as measure the nonhomogeneous Poisson point process with
non-constant intensity function λD in time say, λD (t) the expected number of points
in D in interval time of length t, with t ∈ [0, T ] and the main assumption that
points are independent of each other. The nonhomogeneous Poisson process has the
following three important properties as described in Baddeley et al. (2016):
• conditional property: given exactly n points in a region Dkt , these points are
mutually independent and each point has the same probability distribution over
D, with probability density Yk(t) = YD t ∼ fDk (t) = λDk (t)/μD , where μD =
! k
λD (t)dt.
D
• superposition property: If Y1 , Y2 , . . . , Yn are independent Poisson random
variables with means E(Yi ) = λi , λi ∈ R+ , then i Yi ∼ P oisson( λi ).
• random thinning property: Suppose that N ∼ P oisson(λ), and that
Y1 , Y2 , . . . , Yn are independent, identically distributed multinomial random
variables with distribution Multinomial (p1 , p2 , . . . , pn ), that is P {Yi = k} =
pk for k = 1, 2, . . . , m. Then the random variables N1 , N2 , . . . , Nm defined by
k
Nk = 1{Yi = k} are independent Poisson random variables with parameters
i=1
E(Nk ) = (λpk ).
Hence λD t = λ(·, Dkt ) is a random variable in [0, +∞] for every Dkt ⊆ D. Such
k
an idea is also described in Thäle and Yukich (2016) with properties of Poisson-
Voronoi generated sets.
The Voronoi cells are defined as a point process Dkt ⊆ D, each Dkt containing a
finite number of simple points and they can be viewed as a countable set of random
points with associated intensity λD = E(N(Di )) the expected number of points
in Di .
166 J. L. Matthews et al.
The cells Dkt define a sequence of original/congruent points and for partition
{Dkt } of D, each Dkt can be associated with the original point xkt . If D(t) denotes the
7
n(t)
collection of Voronoi cells generated at time t, D(t) = t
Dk(t) , the elements of
k(t)=1
D(t) are nested within D(t − 1), i.e., D(t) is a refinement of D(t − 1). They include
convex sets (Kieu et al. 2013), differentiable manifolds with smooth boundary of
finite dimensional Hausdorff-type measure with topology on R.
We formalize the notion of conditional distribution of the count given that a point
has occurred under Palm probability/measure and define it as Px t (Dkt ) and because
k(t)
of time dependence we consider the reduced Palm distribution P! x t (Dk(t) t ) denoted
k(t)
as Qx t ,which is the conditional distribution of the count when the point at x is
k(t)
omitted.
The cells are such that their asymptotic mean and variance are a function of the
weighted surface (Reitzner et al. 2012).
For a random exchangeable sequence of spaces (D1t , . . . , Dnt t ) defined at time t ∈
[ti−1 , ti ), let Qtk(t) denote the empirical distribution of the sets generated from points
t , 1 < k(t) < n(t). Let (Y
xk(t) k(t) ) be the sequence count of outcomes observed in
t
(Dk(t) ), then
!
P Y1 , . . . , Ynt = Qx t , or
i
!
P Yk(t) , 1 ≤ k(t) ≤ n(t) = Qx(t−1) (Dk(t) )dF (Q),
!
P Yk(t) |Yk(t−1) = Qxk(t) (Dkt (t))dF (Q).
1≤k(t)≤n(t)
For a time period of length T partitioned into m time subintervals t0 = 0 < t1 <
t2 < · · · tr < · · · < tm , consider [tr−1 , tr ), r = 0, 1, . . . , m, 0 ≤ tr ≤ T . The
number of random occurrences or points generated within (tr−1 , tr ) in state space D
are tied to locations which are sequences of nested counts of total number of points
Quantifying Spatio-Temporal Characteristics via Moran’s Statistics 167
generated at time tr , denoted n(tr ), and sequence of points k(t1 ), k(t2 ), . . . , k(tm ),
where 1 ≤ k(t1 ) ≤ n(t1 ); 1 ≤ k(t2 ) ≤ n(t2 ); . . . ; 1 ≤ k(tr ) ≤ n(tr ); . . . ; 1 ≤
k(tm ) ≤ n(tm ), and n(tr ) represents the total number of points generated at time
tr within specific locations. The mathematics then requires notation capable of
representing outcomes across multiple time point steps and subareas. The associated
count within Dkt defines a sequence {Y tr } of number of occurrences within [tr−1 , tr )
t
in subarea Dk(t) and follows the Markov property, i.e., given the current state of the
system, we can make predictions about the future state without regard for previous
states. This is a discrete time finite Markov chain.
To minimize complexity, we define the count process {Y t , t ≥ 0} of occurrences
between consecutive times tr−1 and tr to be
where λD (t) denotes the local intensity for any t within [tr−1 , tr ). Such a process
has stationary transition probabilities and transition matrix of the counts Q = (Qij )
constructed from the transition probabilities, Pij (s), where Qij is the (i, j ) count
from Pij (s) from consecutive time periods t1 , t2 , . . . , tm adjusted with subarea
containing locations i and j , and Qii = 0, ∀ i ∈ S, at each time period.
tr tr−1
We extend Dkt such that they form a nested sequence Dk(t r)
⊂ Dk(t r−1 )
, the
t , 1 ≤ k(t) ≤ n(t). Following Resnick
subarea within time (tr−1 , tr ) and count Yk(t)
(2002), we can define the nested subareas Dk(t) t and each fixed time t, a sequence
{Ek(t) }k≥0 of independent and identically distributed exponential random variables
t
By discretizing time, we then define two finite sequences {(Y tr ), (tr )}, where
n(t)
Yt = t
Yk(t) t
with locations Yk(t) for tn < t < tn+1 and {tr }, r = 1, . . . , m
k(t)=1
˜
is the sequence of times when the process is observed. The sequence {Y t } is the
cumulative count of occurrence within interval time [tr−1 , tr ) in a subset of space D
such that tr − tr−1 is conditionally independent and exponential given Y tr−1 . More
precisely, the sequence {Yk(t)
t } is defined as follows:
Y1t1
..
.
t1
Yk(t 1)
..
.
t1
Yn(t 1)
168 J. L. Matthews et al.
n(t 1)
t1
and Y t1 = Yk(t 1)
;
k(t1 )=1
..
.
Y1tr
..
.
tr
Yk(t r)
..
.
tr
Yn(t r)
n(t r)
tr
and Y tr
= Yk(t r)
;
k(tr )=1
..
.
Y1tm
..
.
tm
Yn(t m)
n(t m)
tm
and Y tm
= Yk(t m)
,
k(tm )=1
3 Moran’s Statistics
Using Eq. (1), Vaillant et al. (2011) modeled the propagation of sugarcane yellow
leaf virus with a focus on the spatial spread of disease over six time periods (weeks
6, 10, 14, 19, 23, and 30 in the growing season). For each pair of consecutive
observation dates (ti−1 , ti ), i = 1, 2, . . . , 6, they defined a Moran-type index based
on a nearest neighbor scheme as follows:
where D denotes the discrete set of plant locations, Tx denotes the date (time
variable) of virus detection for plant x, 1[0,ti−1 ] (Tx ) is an indicator whether time
Tx falls in the interval [0, ti−1 ], and wx,y denotes the weights, which are non-zero
only if x and y are neighbors, located within the same subarea. There are several
possible weight functions as we will describe later in this section.
We will utilize a similar space-time definition of Moran’s index. Since the
partitioned areas vary from one time to the next, we define Dkt as subarea k ≥ 1
at time t and then we will define a Moran-type autocorrelation statistic on each
measurable subset Dkt as:
where wu,u denotes the spatial weight between points u and u of disease/event
occurrence (wu,u = 0), Tu and Tu denote the time of detection of u and u , and
1[t−1,t) (Tu , Tu ) is an indicator of whether times Tu and Tu both fall in the interval
[t − 1, t).
In the definition of Moran’s autocorrelation index, wu,u represents a spatial
weight between any two distinct event locations generated within disc Dkt , which
could be a function, e.g.,
(i) the inverse distance between two points;
(ii) the inverse distance squared between two points;
(iii) an estimate of the autocorrelation/semivariance statistic;
(iv) the “geographical” weights defined as
e(−dij /r) j = i,
wij =
0 otherwise,
where r is the maximum distance in the minimum tree that spans all points and
does not have any nodes that link back to itself as in Murakami et al. (2017).
As stated in Chen (2012), selection of the weight function objectively is an open
question. Different weight functions have different spatial effects. Later in Sect. 4,
we give reasons why we choose a particular weight function.
170 J. L. Matthews et al.
Also, while values of the Moran-type statistic may not always be between −1
and 1 (the full range depends on the weights), they are essentially in the same spirit
as larger values indicate larger autocorrelation measures.
We will use a modified version of Geary’s C autocorrelation index (Geary 1954;
Sokal et al. 1998) as a comparison to determine and evaluate the trend between the
two measures. We will define it as Ckt :
n−1
with wu,u as in Eq. (2) and W0 = wu,u , the sum of weights for all points
u,u ∈D t
generated across all discs at time t.
Cliff and Ord (1981) derived the moments of Moran’s index given the assumption
that observations are random independent drawings from normal populations. Under
this assumption, the expected value of Moran’s index is given as:
n E i,jwij Zu(t) Zu (t) 1
E(I ) = =− ,
S0 E Z 2 n − 1
i i
where S0 = wij . To link our modified Moran-type autocorrelation to the global
Moran’s index, it can be viewed as
1 if u appeared in Dkt in the time period [t − 1, t),
where Zu(t) = 1D t (Tu ) =
k 0 otherwise.
Then Zu(t) is the indication of occurrence of event within the disc centered at an
event from the previous time point.
Quantifying Spatio-Temporal Characteristics via Moran’s Statistics 171
Given the assumption that observations are random independent drawings within
the same disc from a given distribution, the expected value of the proposed
spatio-temporal Moran’s autocorrelation statistic can be derived as follows:
⎛ ⎞
4 Simulation Study
5 5
3 3
6 6
4 4
1 1
(a) (b)
The process is described at time t = 3 with disc 1 having 10 subareas (see Fig. 3),
disc 2 having 7 subareas, etc. The non-zero Moran’s values for disc 1 at time t = 3
are listed in Table 2.
After five time intervals, the clustering of points generated is displayed in Fig. 3,
with red shades representing the most dense areas and white shades representing
areas with no instances of disease. Such a representation would not have been clearly
observed if the area was not partitioned and if the rate of spread was not tabulated
in space and time. This framework will allow us to detect clusters and estimate the
density.
174 J. L. Matthews et al.
5 Model Analysis
Here we investigate any trend in our Moran’s values based on the number of points
generated and area of the disc. Figure 4 shows area and number of points plotted
versus Moran’s value. The plots of Moran’s values for time t = 2 and for disc 1
at t = 3 and t = 4 show an approximate linear trend between Moran’s values and
number of points generated, even though there is no apparent relationship between
area and Moran’s values.
The 3-D plot in Fig. 5 shows that at each successive time point the number of
points generated versus the Moran’s gets progressively larger. This is due to the
number of points generated increasing as time increases, but the area increase is
moderated as time increases. This results in points that are closer together, thus
increasing the sum of inverse distances and the Moran’s value.
Focusing on the six points generated at t = 1 and the associated discs at t = 2,
we investigate the idea of correlation between two consecutive Poisson distributions.
Table 3 shows the number of points generated within each of the original six discs
at times t = 2, 3, 4, 5 and also includes the area of each disc. The total number of
points generated is also shown and each of these points has an associated Moran’s
statistic at the next time point.
Quantifying Spatio-Temporal Characteristics via Moran’s Statistics 175
Fig. 4 Trend analysis (a) t = 2, (b) t = 3 for disc 1, and (c) t = 4 for disc 1
2000
1500
Moran
1000
# of points
500
40
30
20
10
0
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
area
t=1 t=2 t=3 t=4 t=5
6 Conclusion
Our designed model has shown that global properties of spatio-temporal disease
spread can be tuned in by modifying the space and adding time efficiently. The
results show that long range, spread feature, and variability yielded spatial clustering
of occurrence events. To obtain a quantitative analysis of the performance of
the Moran’s values, we compared them with adjusted Geary’s C statistics. We
controlled for targeted subareas and such an approach provides a framework for
understanding “disorganized” or “disordered” evolution of some natural organisms
in the form of cluster analysis. The distribution of subareas can be found based on
our density estimation or concentration under the proposed spatio-temporal Moran’s
index.
Extension to different dynamics other than the Poisson will be considered in
future studies and widen the scope of the spatio-temporal model incorporating
covariates.
Acknowledgements The research for this paper was made possible by financial support from the
U.S. Naval Academy. Material contained herein is solely the responsibility of the authors and is
made available for the purpose of peer review and discussion. Its contents do not necessarily reflect
the views of the Department of the Navy or the Department of Defense.
References
Anselin, L.: Local indicators of spatial association - LISA. Geogr. Anal. 27(2), 93–115 (1995).
https://doi.org/10.1111/j.1538-4632.1995.tb00338.x
Baddeley, A., Rubak, E., Turner, R.: Spatial Point Patterns: Methodology and Applications with R.
CRC Press, Boca Raton (2016)
Chen, Y.: On the four types of weight functions for spatial contiguity matrix. Lett. Spat. Resour.
Sci. 5, 65–72 (2012). https://doi.org/10.1007/s12076-011-0076-6
Cliff, A., Ord, J.: Space-time modelling with an application to regional forecasting. Trans. Inst. Br.
Geogr. 64, 119–128 (1975). https://doi.org/10.2307/621469
Cliff, A., Ord, J.: Spatial Processes - Models and Applications. Pion Limited, London (1981)
Geary, R.C.: The contiguity ratio & statistical mapping. Inc. Stat. 5(3), 115–145 (1954). https://
doi.org/10.2307/2986645
Jones-Todd, C.M., Swallow, B., Illian, J.B., Toms, M.: A spatio-temporal multispecies model of
a semicontinuous response. J. R. Stat. Soc. Ser. C 67(3), 705–722 (2018). https://doi.org/10.
1111/rssc.12250
Kallenberg, O.: Foundations of Modern Probability, 2nd ed. Springer, New York (2001)
Kieu, K., Adamczyk-Chauvat, K., Monod, H., Stoica, R.S.: A completely random T-tessellation
model and Gibbsian extensions. Spat. Stat. 6, 118–138 (2013). http://dx.doi.org/10.1016/j.
spasta.2013.09.003
Lawson, A.: Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology. Chap-
man & Hall/Crc Interdisciplinary Statistics Series. CRC Press, Boca Raton (2009)
Lee, J., Li, S.: Extending Moran’s index for measuring spatiotemporal clustering of geographic
events. Geogr. Anal. 49, 36–57 (2017). https://doi.org/10.1111/gean.12106
Martin, R.L., Oeppen, J.E.: The identification of regional forecasting models using space:time
correlation functions. Trans. Inst. Br. Geogr. 66, 95–118 (1975). https://doi.org/10.2307/621623
Quantifying Spatio-Temporal Characteristics via Moran’s Statistics 177
Meddens, A.J.H., Hicke, J.A.: Spatial and temporal patterns of Landsat-based detection of tree
mortality caused by a mountain pine beetle outbreak in Colorado, USA. For. Ecol. Manag. 322,
78–88 (2014). https://doi.org/10.1016/j.foreco.2014.02.037
Moran, P.A.P.: Notes on continuous stochastic phenomena. Biometrika 37(1–2), 17–23 (1950).
https://doi.org/10.1093/biomet/37.1-2.17
Murakami, D., Yoshida, T., Seay, H., Griffith, D.A., Yamagata, Y.: A Moran coefficient-based
mixed effects approach to investigate spatially varying relationships. Spat. Stat. 19, 68–69
(2017). https://doi.org/10.1016/j.spasta.2016.12.001
Pace, R.K., LeSage, J.P.: Omitted variable and spatially dependent variables. In: Páez, A., et al.,
(eds.) Progress in Spatial Analysis: Advances in Spatial Science, pp. 17–28. Springer, Berlin
(2009)
Reitzner, M., Spodarev, E., Zaporozhets, D.: Set reconstruction by Voronoi cells. Adv. Appl.
Probab. 44(4), 938–953 (2012). https://doi.org/10.1239/aap/1354716584
Resnick, S.I.: Adventures in Stochastic Processes. Birkhäuser, Boston (2002). https://doi.org/10.
1007/978-1-4612-0387-2
Sokal, R.R., Oden, N.L., Thomson, B.A.: Local spatial autocorrelation in a biological model.
Geogr. Anal. 30(4), 331–354 (1998). https://doi.org/10.1111/j.1538-4632.1998.tb00406.x
Thäle, C., Yukich, J.E.: Asymptotic theory for statistics of Poisson-Voronoi approximation.
Bernoulli 22(4), 2372–2400 (2016). https://doi.org/10.3150/15-BEJ732
Vaillant, J., Puggioni, G., Waller, L.A., Daugrois, J.: A spatio-temporal analysis of the spread of
sugarcane yellow leaf virus. J. Time Ser. Anal. 32, 392–406 (2011). https://doi.org/10.1111/j.
1467-9892.2011.00730.x
Wang, Y.F., He, H.L.: Spatial Data Analysis Method. Science Press, Beijing (2007)
Wang, H., Cheng, Q., Zuo, R.: Quantifying the spatial characteristics of geochemical pattern via
GIS-based geographically weighted statistics. J. Geochem. Explor. 157, 110–119 (2015)
Zhou, H., Lawson, A.B.: EWMA smoothing and Bayesian spatial modeling for health surveillance.
Stat. Med. 27, 5907–5928 (2008). https://doi.org/10.1002/sim.3409