Multivariate Statistical Functions in R
Multivariate Statistical Functions in R
Michail Tsagris
mtsagris@uoc.gr
Version 9.7
1 June 2021
Contents
1 Some things about R 1
1.1 A few tips for faster implementations . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Parallel computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Duration of a processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Libraries required . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iii
4.2.4 Principal components regression . . . . . . . . . . . . . . . . . . . . . . 58
4.2.5 Principal components regression for binary and count data . . . . . . . 63
4.2.6 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Fisher’s linear discriminant function . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Repeated cross validation for linear and quadratic discriminant analysis 79
4.3.3 A simple model selection procedure in discriminant analysis . . . . . . 81
4.3.4 Box-Cox transformation in discriminant analysis . . . . . . . . . . . . . 83
4.3.5 Regularised discriminant analysis . . . . . . . . . . . . . . . . . . . . . 84
4.3.6 Discriminant analysis with mixed data . . . . . . . . . . . . . . . . . . 90
4.3.7 Discriminant analysis for multinomial data . . . . . . . . . . . . . . . . 93
5 Distributions 98
5.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.1 Kullback-Leibler divergence between two multivariate normal popu-
lations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.2 Estimation of the parameters of a multivariate log-normal distribution 98
5.1.3 Estimation of the parameters of a multivariate t distribution . . . . . . 99
5.1.4 Estimation of the parameters of a multivariate Laplace distribution . . 103
5.1.5 Estimation of the parameters of an inverted Dirichlet distribution . . . 104
5.1.6 Multivariate kernel density estimation . . . . . . . . . . . . . . . . . . . 106
5.1.7 Bivariate Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.8 A goodness of fit test for the bivariate Poisson . . . . . . . . . . . . . . 115
5.1.9 Estimating the parameters of a Dirichlet-Multinomial distribution . . 117
5.2 Random values generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.1 Random values generation from a multivariate normal distribution . . 120
5.2.2 Random values generation of covariance matrices (Wishart distribution)121
5.2.3 Random values generation from a multivariate t distribution . . . . . 122
5.2.4 Random values generation from a multivariate Laplace distribution . 122
5.2.5 Random values generation from a Dirichlet or an inverted Dirichlet
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Contour plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.1 Contour plot of the bivariate normal, t and skew-normal distributions 124
5.3.2 Contour plot of a bivariate log-normal distribution . . . . . . . . . . . 127
5.3.3 Contour plot of a bivariate inverted Dirichlet distribution . . . . . . . 128
5.3.4 Contour plot of a kernel density estimate . . . . . . . . . . . . . . . . . 128
5.3.5 Contour plot of the bivariate Poisson distribution . . . . . . . . . . . . 129
iv
6 Covariance, principal component analysis and singular value decomposition 131
6.1 Fast covariance and correlation matrices . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Fast Mahalanobis distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Fast column-wise variances or standard deviations . . . . . . . . . . . . . . . . 132
6.4 Multivariate standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.5 Choosing the number of principal components using SVD . . . . . . . . . . . 134
6.6 Choosing the number of principal components using probabilistic PCA . . . . 137
6.7 Confidence interval for the percentage of variance retained by the first κ com-
ponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.8 A metric for covariance matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.9 The Helmert matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.10 The Moore-Penrose pseudo-inverse matrix . . . . . . . . . . . . . . . . . . . . 142
6.11 A not so useful pseudo-inverse matrix . . . . . . . . . . . . . . . . . . . . . . . 143
6.12 Exponential of a square matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
v
8.4.6 Contour plot of a kernel density estimation in S2 . . . . . . . . . . . . . 193
8.5 The α-transformation for compositional data . . . . . . . . . . . . . . . . . . . 196
8.5.1 The α-transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.5.2 The α-distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.5.3 The Fréchet mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.5.4 Profile log-likelihood of α . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.6 Regression for compositional data . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.6.1 Regression using the additive log-ratio transformation . . . . . . . . . 204
8.6.2 Simple Dirichlet regression . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.6.3 Mixed Dirichlet regression . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.6.4 OLS regression for compositional data . . . . . . . . . . . . . . . . . . . 212
8.6.5 Multinomial logit regression (or Kullback-Leibler divergence based re-
gression for compositional data) . . . . . . . . . . . . . . . . . . . . . . 215
8.6.6 ESOV (Kullback-Leibler divergence based) regression . . . . . . . . . . 218
8.6.7 The α-regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
8.6.8 Regression for compositional data with compositional covariates . . . 228
8.6.9 Univariate regression where the independent variables are composi-
tional data using the α-transformation . . . . . . . . . . . . . . . . . . . 230
8.7 Model based clustering for compositional data . . . . . . . . . . . . . . . . . . 233
8.7.1 Fitting a mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
8.7.2 Choosing the optimal mixture model via BIC . . . . . . . . . . . . . . . 236
8.7.3 Simulation of random values from a normal mixture model . . . . . . 237
8.8 Discriminant analysis (classification) for compositional data . . . . . . . . . . 238
8.8.1 The k-NN algorithm with the power transformation for compositional
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.8.2 The k-NN algorithm with the α-metric . . . . . . . . . . . . . . . . . . . 246
8.8.3 Regularised discriminant analysis with the α-transformation . . . . . . 249
vi
9.5.5 Tangential approach for testing the equality of the concentration pa-
rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.5.6 Analysis of variance without assuming equality of the concentration
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.6 Circular correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
9.6.1 Circular-circular correlation I . . . . . . . . . . . . . . . . . . . . . . . . 269
9.6.2 Circular-circular correlation II . . . . . . . . . . . . . . . . . . . . . . . . 270
9.6.3 Circular-linear correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 271
9.7 Regression for circular data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
9.7.1 Regression for circular data using the von Mises distribution . . . . . . 272
9.7.2 Projected bivariate normal for circular regression . . . . . . . . . . . . 273
vii
10.8.1 Simulation from a von Mises-Fisher distribution . . . . . . . . . . . . . 314
10.8.2 Simulation from a Bingham distribution . . . . . . . . . . . . . . . . . . 316
10.8.3 Simulation from a Fisher-Bingham distribution . . . . . . . . . . . . . 320
10.8.4 Simulation of random values from a von Mises-Fisher mixture model 323
10.9 Contour plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
10.9.1 Contour plots of the von Mises-Fisher distribution . . . . . . . . . . . . 323
10.9.2 Contour plots of the Kent distribution . . . . . . . . . . . . . . . . . . . 325
10.9.3 Contour plots of the Kent distribution fitted to spherical data . . . . . 326
10.9.4 Contour plots of a von Mises-Fisher kernel density estimate on the
sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.9.5 Contour plots of a von Mises-Fisher mixture model on the sphere . . . 329
10.10Discriminant analysis for (hyper-)spherical (and circular) data . . . . . . . . . 330
10.10.1 Discriminant analysis using the von Mises-Fisher distribution . . . . . 330
10.10.2 Discriminant analysis using the k-NN algorithm . . . . . . . . . . . . . 333
10.11Model based clustering using mixtures of von Mises-Fisher distributions . . . 337
10.11.1 Fitting a mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
10.11.2 Choosing the number of components of the mixture model . . . . . . . 342
10.12Lambert’s equal area projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
viii
Prologue
The motivation for the writing of these functions was to offer some form of an alternative
R-package with simple (and easy to modify) functions. The resume of the theory is also
provided, so that somebody does not have to read a whole chapter or a paper to understand
what is the key message there. Most of the functions are not available in any R package,
but R is a very popular statistical language and packages are uploaded very frequently. So
maybe some of the functions exist already in other packages, or even are built-in functions
or they have been added in packages after I wrote them here. Some functions have been
tested using example data sets found at the references. The others were tested numerically,
for example, the hypothesis testing procedures, do they estimate correctly the type I error?
As I update the versions, by adding new stuff, I also check for mistakes and correct
them. So I would suggest you keep the newest versions. However, mistakes will still be
around I am afraid, and they are, so corrections or any comments are most welcome and of
course required. Note also, that I have added a log of changes so that anybody can track any
changes from version to version. Also within one single version sometimes I upload updates
with corrections. The log can be found at the end of the document, just before the references.
I know that even this version needs a bit polishing and in some cases more explanation of
the algorithms. These are being done, even slowly.
This document has somehow changed direction and now more functions for univariate
data are included. For example, ridge regression, kernel regression and principal compo-
nents regression. The reason for which they are included here is their use of multivariate
techniques, such as PCA.
I decided to start using R packages inside my functions. The reason for this is that they
give me more flexibility and ability to add more. For instance, the package doParallel allows
me to do parallel calculations and thus speed up many codes which is also beneficiary for
the readers.
Another slight change of direction is towards education. What I mean, is that I have
added functions which make no use for the practitioners or the researchers, but they help
in understanding some stuff. Students will find them very useful in gaining insight in some
things. For example, the relationship between Hotelling’s T2 and James test or the bootstrap
correlation coefficient. In the second case, there are two functions, one non vectorised and
one vectorised. This serves two goals. The first one is to get an idea of how you can vectorise
(if possible) your codes (educational reason and also a helpful tip) and the second is time.
In small to moderate sample sizes, the non vectorised is slower, but in big samples, the
non vectorised is faster(?!). For the tuning of the bandwidth in the kernel density estimate
via maximization of the pseudo log-likelihood I present two functions. One using parallel
computation and another one without. If you have small datasets the second version is
faster. But, if you have big data and many cores, then maybe the first function is faster (I
ix
have not tried it and do not know).
Feel free to contribute your own functions and you will be credited of course. If you
want the functions in a .txt or .R format please send me an e-mail. If you cannot download
this document for some reason, send me an e-mail as well.
I would like to express my gratitude to Andrew Rattray (postgraduate student at the
university of Nottingham during 2012-2013) for pointing out a mistake in the Box’s M test
code.
A very good (in my opinion) manual with R functions is written by Paul Hewson. Don’t
forget to see R Inferno.
Professor Andy Wood and Dr Simon Preston from the university of Nottingham are
highly appreciated for being my supervisors during my PhD (compositional data analysis)
and post-doc (directional data analysis).
Georgios Pappas from the university of Nottingham helped me construct the contour
plots of the von Mises-Fisher and the Kent distribution.
Christopher Fallaize and Theo Kypraios from the university of Nottingham have pro-
vided a function for simulating from the Bingham distribution using rejection sampling. So
any questions regarding this function should be addressed to them.
Kwang-Rae Kim from the university of Nottingham helped me create a front end with
Matlab.
Marco Maier from the Institute for Statistics and Mathematics in Vienna suggested me
to change the format of my R functions and told me about the R package formatR. He even
sent me the command with an example, so he is greatly appreciated.
Giorgos Borboudakis from the Foundation of Research and Technology (FORTH) and
member of the MXM group pointed out to me a not so clear message in the algorithm of
generating random values from the von Mises-Fisher distribution.
Panagiotis (pronounced Panayiotis) Tzirakis (master student at the computer science de-
partment in Herakleion) showed me how to perform parallel computing in R and he is
greatly acknowledged and appreciated not only from me but from all the readers of this
document. He also helped me with the vectorization of some contour plot functions.
Giorgos Athineou (master student at the computer science department in Herakleion)
has taught me a few tips in R and he is greatly appreciated.
Professor John Kent from the university of Leeds is acknowledged for clarifying one
thing with the β (ovalness) parameter in his distribution.
Professor Changliang Zou from the Nankai University is greatly acknowledged for send-
ing me his function for outlier identification for high-dimensional data (function rmdp).
Manos Papadakis (undergraduate in the computer science department in Herakleion)
pointed out the need to avoid matrix multiplications. That takes time. Indeed, you can
check the functions spat.med and spat.med old to see for yourselves. He also gave inspiration
for many more functions to be implemented efficiently.
x
Kleio Lakiotaki (post-doc at the department of computer science in Herakleion) showed
me the potentials of the function outer and the amazing speed of prcomp. The second com-
mand should be used for principal component analysis when you have matrices with more
than 100 variables. The more variables the bigger the difference from doing eigen(cova(x)).
xi
1 Some things about R
1.1 A few tips for faster implementations
I will show a few tips for faster computations. If you want more tips have a look at this
unpublished paper (Tsagris and Papadakis, 2018). In small sample and or small dimension-
alities you may see small differences, but in bigger datasets the differences arise. You might
observe a time difference of only 5 seconds in the whole process. I saw a difference from 40
down to 12 seconds for example. That was very successful. In another case from 40 seconds
to 22. Still successful. But not always this kind of differences. Some times, one tip gives you
1 second and then another tip 1 second and so on until you save 5 seconds. If you have 1000
simulations, then you save 5000 seconds. Even small decreases matter. Perhaps for some-
one who needs a simple car, a very expensive car or a jeep type might not be of such use,
especially if he or she does not go to the village. But for the user who needs a jeep, every
computational power, every second he/she can gain matters.
The nlm is much faster than optim for optimization purposes but optim is more reliable
and robust. Try in your examples or cases, if they give the same results and choose. Or
use first nlm followed by optim. The exponential term in the multivariate normal can be
either calculated using matrices or simply with the command mahalanobis. If you have many
observations and many dimensions and or many groups, this can save you a looot of time
(I have seen this).
Suppose you want to calculate the product of an n × p matrix X T X for example. The
command crossprod(X) will do the job faster than if you do the matrix multiplication.
Next, you want to center some data, you can try with apply for example
or using this
m <- colMeans(data)
1
n <- dim(data)[1] ; p <- dim(data)[2]
y <- t( t(data) - m )
data <- scale(data, center = TRUE, scale = FALSE)
data <- sweep(data, 2L, m)
data <- data - rep( m, rep(n, p) ) ## looks like the fastest
See also Gaston Sanchez’s webpage for a comparison of these. Or you can compare the
times yourself.
If you want to extract the mean vector of each group you can use a loop (for function) or
where ina is a numerical variable indicating the group. A faster alternative is the built-in
command rowsum
y <- x^2
Of course, this is a very easy example, but you see my point. This one requires a lot of
thinking and is not always applicable. But, if it can be done, things can be super faster. See
the bootstrap correlation coefficient for example, where I have two functions, boot.correl with
a for loop and bootcor, which is vectorised.
Use apply or aggregate we saw before whenever possible. But, use colMeans or colSums
instead of apply(x, 2, mean) to get the mean vector of a sample because it’s faster. For the
median though, you have to use apply(x, 2, median) instead of a for going to every column of
the matrix. Imagine you have a dependent variable y and many independent variables xi s
and you want to perform regression of y on every xi and obtain the betas for every simple
linear regression. You can do a for loop or you can do this
2
funa <- function(x) coef( lm(y~x) )
apply(x, 2, funa)
What if you have an array with matrices and want to calculate the sum or the mean of all
the matrices? The obvious answer is to use apply(x, 1:2, mean). R works rather in a cloumn-
wise fashion than in a row-wise fashion. Instead of the apply you can try t( colSums( aperm(x)
) ) and t( colMeans( aperm(x) ) ) for the sum and mean operations respectively.
If you want the matrix of distances, with the zeros in the diagonal and the upper trian-
gular do not use the command as.matrix(dist(x)) but use dist(x, diag = TRUE, upper = TRUE).
Also, the package fields (Nychka et al., 2015) has a function called rdist which is faster than
the built-in dist in R. Suppose you want the Euclidean distance of a single vector from many
others (say thousands for example). The inefficient way is to calculate the distance matrix
of all points and take the row which corresponds to your vector. The efficient way is to use
the Mahalanobis distance with the identity ad the covariance matrix
Can we make the above faster? The answer is yes, by avoiding the matrix multipli-
cations. You see the matrix multiplications are performed in C++ using a for loop. Even
though it’s fast, it’s not as fast as you think, FORTRAN for example is much much faster.
z <- y - x
a <- sqrt( colSums(z^2) )
Try both ways and see. Check the spatial median Section where I have kept two functions,
one with the Mahalanobis and one with the above trick. Put large data and check the time
required by either function; you will be amazed.
Speaking of Mahalanobis distance, check my function mahala which is twice as fast as the
built-in function mahalanobis.
As for the covariance and correlation matrices I have found a nice way described by
Andrey A. Shabalin and is presented in Section 6.1. I have tested its time with other packages
3
and functions, but it still does better. The difference with the standard R functions becomes
more apparent as you move to higher than 1, 000 dimensions.
The function mean is slower than sum(x)/length(x). If you type sum you will see it is a
.Primitive function, whereas crossprod and colMeans are both .Internal ones. By the way the
colMeans is a really really fast function. My 2 points are a) create your own functions, you
will be surprised to see that you may do faster than R’s built-in functions (it doesn’t always
work that way) and b) use .Internal functions whenever possible. An example of the point
is the var function. Create your own and you will see it is faster. An example of the second
point is the function colVars which uses colMeans.
Search for functions that take less time. For example, the command lm.fit(x,y) is a wrap-
per for lm(y x), which means that the first one is used by the second one to give you the nice
output. But, if you need only the coefficients, for example, then use the first one. The syntax
is a bit different, the x must be the design matrix, but the speed is very different especially
in the big cases. Finally, the multinomial regression is offered in the package VGAM but it
also offered in the package nnet. The implementation in the second package is much faster.
The same is true for the implementation of the ordinal logistic regression in the VGAM and
in the ordinal. The latter package does it much faster. Many fast functions can also be found
in the package Rfast (Papadakis et al., 2019).
If you have a function for which some parameters have to be positive, do not use con-
strained optimization, but instead put an exponential inside the function. The parameter
can take any values in the whole of R but inside the function its exponentiated form is used.
In the end, simply take the exponential of the returned value. As for its variance use the
∆ method (Casella and Berger, 2002). If you did not understand this check the MLE of the
inverted Dirichlet distribution and the Dirichlet regression (φ parameter).
Speaking of Dirichlet distribution, I have two functions for estimating the parameters of
this distributions. One which uses nlm and optim and another one which uses the Newton-
Raphson algorithm. I did some simulations and I saw the Newton-Raphson can be 10 times
faster. The same is true for the circular regression (spml.reg) where I use the E-M algorithm.
Switching to E-M or the Newton-Raphson and not relying on the nlm command can save
you a looot of time. If you want to write a code and you have the description of the E-M or
the Newton-Raphson algorithm available, because somebody did it in a paper for example,
or you can derive it yourself, then do it.
I found this article (pages 18-20) by Douglas Bates very useful and in fact I have taken
some tips from there.
4
tcrossprod(X, Y) #### more efficient
t(X) %*% X #### classical
crossprod(X) #### more efficient
Sticking with solve(X), if you want to only invert a matrix then you should use chol2inv(
chol( X ) ) as it is faster.
Douglas Bates mentions in the same article, that calculating X T Y in R as t( X )% ∗ %Y
instead of crossprod(X,Y) causes X to be transposed twice; once in the calculation of t( X ) and
a second time in the inner loop of the matrix product. The crossprod function does not do
any transposition of matrices.
The trace of the square of a matrix tr A2 can be evaluated either via
or faster via
sum(A * A)
sum(A^2)
or faster
Moving in the same spirit, you want the diagonal of the crossproduct of two matrices,
such as
Suppose you have two matrices A, B and a vector x and want to find ABx (the dimen-
sions must match of course).
In the first case you have a matrix by matrix by vector calculations. In the second case
you have a matrix by vector which is a vector and then a matrix by a vector. You do less
calculations. The final tip is to avoid unnecessary and/or extra calculations and try to avoid
doing calculations more than once.
As for the eigen-value decomposition, there are two ways to do the multiplication
5
s = cov(iris[, 1:4])
eig = eigen(s, symmetric = TRUE)
vec = eig$vectors
lam= eig$values
vec %*% diag(lam) %*%t(vec)
vec %*% ( t(vec) * lam ) ## faster way
If you have an iterative algorithm, such as Newton-Raphson, E-M or fixed points and
you stop when the vector of parameters does not change any further, do not use rbind, cbind
or c(). Store only two values, vec.old and vec.new. What I mean, is do not do for example
So, every time keep two vectors only, not the whole sequence of vectors. The same is true
for the log-likelihood or whatever you have. Unless you want a trace of how things change,
then ok, keep everything. Otherwise, apart from begin faster it also helps the computer
run faster since less memory is used. See the functions spat.med and spat.med old to get an
idea. This tip is due to Manos Papadakis (undergraduate student of the computer science
department in Herakleion).
Avoid unnecessary calculations. In a discriminant analysis setting for example there is
no need to calculate constant parts, such as log (2π ), every time for each group and every
iteration. This only adds time and takes memory and does not affect the algorithm or the
result.
When working with arrays it is more efficient to have them transposed. For example,
if you have K covariance matrices of dimension p × p, you would create an array of di-
mensions c( p, p, K ). Make its dimensions c(K, p, p). If you want for example to divide each
matrix with a different scalar (number) in the first case you will have to use a for loop,
whereas in the transposed case you just divide the array by the vector of the numbers you
have.
6
greatly acknowledged not only by me, but also by the readers of these notes, since they will
save time as well.
The idea behind is to use a library that allows parallel computing. Panayiotis suggested
me the doParallel package (which uses the foreach package) and that is what I will use from
now on. Below are some instructions on how to use the package in order to perform parallel
computing. In addition, I have included the parallel computing as an option in some func-
tions and in some others I have created another function for this purpose. So, if you do not
understand the notes below, you can always see the functions throughout this text.
## requires(doParallel)
Create a set of copies of R running in parallel and communicating
## over sockets.
cl <- makePSOCKcluster(nc) ## nc is the number of cluster you
## want to use
registerDoParallel(cl) ## register the parallel backend with the
## foreach package.
## Now suppose you want to run R simulations, could be
## R=1000 for example
## Divide the number of simulations to smaller equally
## divided chunks.
## Each chunk for a core.
ba <- round( rep(R/nc, nc) )
## Then each core will receive a chunk of simulations
ww <- foreach(j = 1:nc,.combine = rbind) %dopar% {
## see the .combine = rbind. This will put the results in a matrix.
## Every results will be saved in a row.
## So if you have matrices, make them vectors. If you have lists
## you want to return,
## you have to think about it.
a <- test(arguments, R = ba[j], arguments)$results
## Instead of running your function "test" with R simulations
## you run it with R/nc simulations.
## So a stores the result of every chunk of simulations.
return(a)
}
stopCluster(cl) ## stop the cluster of the connections.
To see your outcome all you have to press is ww and you will see something like this
result.1 .....
7
result.2 .....
result.3 .....
result.4 .....
So, the object ww contains the results you want to see in a matrix form. If every time you
want a number, the ww will be a matrix with 1 column. We will see more cases later on.
Note that f you choose to use parallel computing for something simple, multicore analysis
might take the same or a bit more time than single core analysis only because it requires a
couple of seconds to set up the cluster of the cores. In addition, you might use 4 cores, yet
the time is half than with 1 core. This could be because not all 4 cores work at 100% of their
abilities. Of course you can always experiment with these things and see.
ti <- proc.time()
## put your function here
ti <- proc.time() - ti
## ti gives you 3 numbers (all in seconds) like the ones below
user system elapsed
0.18 0.07 3.35
The elapsed is what you want. Alternatively you can download the package microbench-
mark which allows you to compare two or more functions measuring the time even in
nanoseconds.
8
• library(nnet) for the multinomial regression.
• library(robust) for the forward search and in general for a robust covariance matrix
using MCD.
9
2 Hypothesis testing for mean vectors
In this section we shall see many approaches for hypotheses regarding one sample and two
sample mean vectors.
(n − p) n T
T2 = (X̄ − µ ) S−1 (X̄ − µ ) (2.1)
( n − 1) p
Under the null hypothesis, the above test statistic follows the Fp,n− p distribution. The boot-
strap version of the one-sample multivariate generalization of the simple t-test is also in-
cluded in the function. An extra argument (R) indicates whether bootstrap calibration
should be used or not. If R = 1, then the asymptotic theory applies, if R > 1, then the
bootstrap p-value will be applied and the number of re-samples is equal to (B).
10
result <- list(m = m, info = info)
}
if (R > 1) {
## bootstrap calibration
tb <- numeric(R)
mm <- - m + M
y <- x + rep( mm, rep(n, p) ) ## brings the data
## under the null hypothesis, i.e. mean vector equal to M
for (i in 1:R) {
b <- sample(n, n, replace = TRUE, prob = NULL)
yb <- y[b, ]
sb <- cov(yb)
mb <- Rfast::colmeans(yb)
dmb <- mb - M
tb[i] <- dmb %*% solve(sb, dmb)
}
tb <- n * (n - p) / (n - 1) / p * tb
pvalue <- ( sum(tb > test) + 1 )/(R + 1) ## bootstrap p-value
if ( graph ) {
hist(tb, xlab = "Bootstrapped test statistic", main = " ")
abline(v = test, lty = 2, lwd = 2) ## The dotted vertical line
## is the test statistic value
}
result <- list(m = m, pvalue = pvalue)
}
result
}
11
where S is the pooled covariance matrix calculated under the assumption of equal covari-
ance matrices
( n1 − 1) S1 + ( n2 − 1) S2
S= .
n1 + n2 − 2
( n1 + n2 − p − 1) T 2
F=
( n1 + n2 − 2) p
follows the F distribution with p and n1 + n2 − p − 1 degrees of freedom. Similar to the one-
sample test, an extra argument (B) indicates whether bootstrap calibration should be used
or not. If B = 1, then the asymptotic theory applies, if B > 1, then the bootstrap p-value will
be applied and the number of re-samples is equal to (B). The estimate of the common mean
used in the bootstrap to transform the data under the null hypothesis the mean vector of the
combined sample, of all the observations.
The built-in command manova does the same thing exactly. Try it, the asymptotic F test
is what you have to see. In addition, this command allows for more mean vector hypothesis
testing for more than two groups. I noticed this command after I had written my function
and nevertheless as I mention in the introduction this document has an educational charac-
ter as well.
hotel2T2 <- function(x1, x2, a = 0.05, R = 999, graph = FALSE) {
## x1 and x2 are the two multivariate samples a is the level
## of significance, which by default is set to to 0.05
## R is the number of bootstrap replicates
## set by default to 999
## if R=1 no bootstrap will be implemented
## Bootstrap is used for the p-value
p <- dim(x1)[2] ## dimensionality of the data
n1 <- dim(x1)[1] ## size of the first sample
n2 <- dim(x2)[1] ## size of the second sample
n <- n1 + n2 ## total sample size
xbar1 <- Rfast::colmeans(x1) ## sample mean vector of the first sample
xbar2 <- Rfast::colmeans(x2) ## sample mean vector of the second sample
dbar <- xbar2 - xbar1 ## difference of the two mean vectors
mesoi <- rbind(xbar1, xbar2)
rownames(mesoi) <- c("Sample 1", "Sample 2")
if ( is.null(colnames(x1)) ) {
colnames(mesoi) <- colnames(mesoi) <- paste("X", 1:p, sep = "")
} else colnames(mesoi) <- colnames(x1)
12
v <- ( (n1 - 1) * cov(x1) + (n2 - 1) * cov(x2) )/(n - 2)
## v is the pooled covariance matrix
t2 <- n1 * n2 * (dbar %*% solve(v, dbar) ) /n
test <- as.vector( (n - p - 1) * t2 / (n - 2) / p ) ## test statistic
if (R <= 1) {
crit <- qf(1 - a, p, n - p - 1) ## critical value of the F distribution
pvalue <- pf(test, p, n - p - 1, lower.tail = FALSE) ## p-value
info <- c(test, pvalue, crit, p, n - p - 1)
names(info) <- c("test", "p-value", "critical", "numer df", "denom df")
result <- list(mesoi = mesoi, info = info)
}
if (R > 1) {
## bootstrap calibration
mc <- Rfast::colmeans( rbind(x1, x2) ) ## combined sample mean vector
## the next two rows bring the mean vectors of the two sample equal
## to the combined mean and thus equal under the null hypothesis
mc1 <- - xbar1 + mc
mc2 <- - xbar2 + mc
y1 <- x1 + rep( mc1, rep(n1, p) )
y2 <- x2 + rep( mc2, rep(n2, p) )
tb <- numeric(R)
for (i in 1:R) {
b1 <- sample(1:n1, n1, replace = TRUE)
b2 <- sample(1:n2, n2, replace = TRUE)
yb1 <- y1[b1, ] ; yb2 <- y2[b2, ]
db <- Rfast::colmeans(yb1) - Rfast::colmeans(yb2) ## means difference
vb <- ( (n1 - 1) * cov(yb1) + (n2 - 1) * cov(y2) ) / (n - 2)
## vb is the pooled covariance matrix
tb[i] <- n1 * n2 * ( db %*% solve(vb, db) ) / n
}
tb <- (n - p - 1) * tb / (n - 2) / p
pvalue <- ( sum(tb > test) + 1 )/(R + 1)
if ( graph ) {
hist(tb, xlab = "Bootstrapped test statistic", main = " ")
abline(v = test, lty = 2, lwd = 2) ## The line is the test statistic
}
result <- list(mesoi = mesoi, pvalue = pvalue)
}
13
result
}
g g
∑ ( n i − 1) Σ i ∑ ni (X̄i − X̄) (X̄i − X̄)
T
W= and B =
i =1 i =1
|W| 1
Λ= = .
|W + B| | I p + W −1 B |
n − p −1 1− Λ
∼ Fp,n− p−1 if g = 2
p √Λ
n − p −2 1− Λ
p
√ ∼ F2p,2(n− p−2) if g = 3.
Λ
As for the other cases I found a paper by Todorov and Filzmoser (2010) who mention
that a popular approximation to Wilk’s Λ is Bartlett’s χ2 approximation given by
In the function below a message will appear mentioning which approximation has been
used, the F or the χ2 .
14
ni <- tabulate(ina) ## group sample sizes
n <- dim(x)[1] ## total sample size
g <- max(ina) ## number of groups
p <- dim(x)[2] ## dimensionality of the data
m <- rowsum(x, ina) / ni
me <- Rfast::colmeans(x) ## total mean vector
y <- sqrt(ni) * (m - rep(me, rep(g, p)) )
B <- crossprod(y)
Tot <- cov(x) * (n - 1)
lam <- det(Tot - B) / det(Tot)
if (g == 2 ) {
stat <- (n - p - 1 ) / p * (1 - lam)/lam
pvalue <- pf( stat, p, n - p - 1, lower.tail = FALSE )
note <- paste("F approximation has been used")
} else if (g == 3) {
stat <- (n - p - 2 )/p * (1 - sqrt(lam)) / sqrt(lam)
pvalue <- pf( stat, 2 * p, 2 * (n - p - 2), lower.tail = FALSE )
note <- paste("F approximation has been used")
} else {
stat <- -( n - 1 - (p + g)/2 ) * log(lam)
pvalue <- pchisq( stat, p * (g - 1), lower.tail = FALSE )
note <- paste("Chi-square approximation has been used")
}
result <- c(stat, pvalue)
names(result) <- c(’stat’, ’p-value’)
list(note = note, result = result)
}
T S S2
Tu2 = (X̄1 − X̄2 ) S̃−1 (X̄1 − X̄2 ) , with S̃ = S˜1 + S˜2 = 1 + . (2.2)
n1 n2
15
James (1954) suggested that the test statistic is compared with 2h (α), a corrected χ2 distri-
bution whose form is
2h (α) = χ2 A + Bχ2 ,
where
2
1 2 tr S̃−1 S̃i
2p i∑
A = 1+ and
=1
ni − 1
2 tr S̃−1 S̃ 2 −1 S̃ 2
" #
2
1 1 tr S̃
p ( p + 2) i∑
+ ∑
i i
B = .
=1
n i − 1 2 i =1
n i − 1
If you want to do bootstrap to get the p-value, then you must transform the data under
the null hypothesis. The estimate of the common mean is given by (Aitchison, 2003)
−1 −1
µ c = n1 S1−1 + n2 S2−1
µ̂ n1 S1−1 X̄1 + n2 S2−1 X̄2 = S̃1−1 + S̃2−1 S̃1−1 X̄1 + S̃2−1 X̄2 (2.3)
The modified Nel and van der Merwe (1986) test is based on the same quadratic form as
that of James but the distribution used to compare the value of the test statistic is different. It
is shown in Krishnamoorthy and Yu (2004) that Tu2 ∼ ν− p+1 Fp,ν− p+1 approximately, where
νp
p + p2
ν= n h 2 i 2 o n h 2 i 2 o .
1
+ n12 tr S2 S̃
n1 tr S1 S̃ + tr S1 S̃ + tr S2 S̃
The algorithm is taken by Krishnamoorthy and Xia (2006). The R-code for both versions
(with the option for a bootstrap p-value) is the following
james <- function(y1, y2, a = 0.05, R = 999, graph = FALSE) {
## y1 and y2 are the two samples
## a is the significance level and
## if R==1 the James test is performed
## if R==2 the Nel and van der Merwe test is performed
## if R>2 bootstrap calculation of the p-value is performed
## 999 bootstrap resamples are set by default
## Bootstrap is used for the p-value
## if graph is TRUE, the bootstrap statics are plotted
p <- dim(y1)[2] ## dimensionality of the data
n1 <- dim(y1)[1] ; n2 <- dim(y2)[1] ## sample sizes
n <- n1 + n2 ## the total sample size
ybar1 <- Rfast::colmeans(y1) ## sample mean vector of the first sample
ybar2 <- Rfast::colmeans(y2) ## sample mean vector of the second sample
16
dbar <- ybar2 - ybar1 ## difference of the two mean vectors
mesoi <- rbind(ybar1, ybar2)
rownames(mesoi) <- c("Sample 1", "Sample 2")
if ( is.null(colnames(y1)) ) {
colnames(mesoi) <- paste("X", 1:p, sep = "")
} else colnames(mesoi) <- colnames(y1)
A1 <- cov(y1)/n1
A2 <- cov(y2)/n2
V <- A1 + A2 ## covariance matrix of the difference
Vinv <- chol2inv( chol(V) ) ## same as solve(V), but faster
test <- sum( dbar %*% Vinv * dbar )
b1 <- Vinv %*% A1
b2 <- Vinv %*% A2
trb1 <- sum( diag(b1) )
trb2 <- sum( diag(b2) )
if (R == 1) {
## James test
A <- 1 + ( trb1^2/(n1 - 1) + trb2^2/(n2 - 1) ) / (2 * p)
B <- ( sum(b1 * b1) / (n1 - 1) + sum(b2 * b2)/(n2 - 1) +
0.5 * trb1 ^ 2/ (n1 - 1) + 0.5 * trb2^2/(n2 - 1) ) / (p * (p + 2))
x2 <- qchisq(1 - a, p)
delta <- (A + B * x2)
twoha <- x2 * delta ## corrected critical value of the chi-square
pvalue <- pchisq(test/delta, p, lower.tail = FALSE) ## p-value
info <- c(test, pvalue, delta, twoha)
names(info) <- c("test", "p-value", "correction", "corrected.critical")
note <- paste("James test")
result <- list(note = note, mesoi = mesoi, info = info)
} else if (R == 2) {
## MNV test
low <- ( sum( b1^2 ) + trb1^2 ) / n1 +
( sum( b2^2 ) + trb2^2 ) / n2
v <- (p + p^2) / low
test <- as.numeric( (v - p + 1) / (v * p) * test ) ## test statistic
crit <- qf(1 - a, p, v - p + 1) ## critical value of the F distribution
pvalue <- pf(test, p, v - p + 1, lower.tail = FALSE) ## p-value
info <- c(test, pvalue, crit, p, v - p + 1)
17
names(info) <- c("test", "p-value", "critical", "numer df", "denom df")
note <- paste("MNV variant of James test")
result <- list(note = note, mesoi = mesoi, info = info)
} else if (R > 2) {
## bootstrap calibration
runtime <- proc.time()
a1inv <- chol2inv( chol(A1) )
a2inv <- chol2inv( chol(A2) )
mc <- solve( a1inv + a2inv ) %*% ( a1inv %*% ybar1 + a2inv %*% ybar2 )
## mc is the combined sample mean vector
## the next two rows bring the mean vectors of the two sample equal
## to the combined mean and thus equal under the null hypothesis
mc1 <- - ybar1 + mc
mc2 <- - ybar2 + mc
x1 <- y1 + rep( mc1, rep(n1, p) )
x2 <- y2 + rep( mc2, rep(n2, p) )
tb <- numeric(R)
for (i in 1:R) {
b1 <- sample(1:n1, n1, replace = TRUE)
b2 <- sample(1:n2, n2, replace = TRUE)
xb1 <- x1[b1, ] ; xb2 <- x2[b2, ]
db <- Rfast::colmeans(xb1) - Rfast::colmeans(xb2) ## means difference
Vb <- cov(xb1) / n1 + cov(xb2) / n2 ## covariance matrix of the difference
tb[i] <- sum( db %*% solve(Vb, db ) )
}
pvalue <- ( sum(tb > test) + 1 ) / (R + 1)
if ( graph ) {
hist(tb, xlab = "Bootstrapped test statistic", main = " ")
abline(v = test, lty = 2, lwd = 2) ## The line is the test statistic
}
note <- paste("Bootstrap calibration")
runtime <- proc.time() - runtime
result <- list(note = note, mesoi = mesoi, pvalue = pvalue,
runtime = runtime)
}
result
}
18
2.5 Relationship between the Hotelling’s T2 and James test
Emerson (2009, pg. 76-81) mentioned a very nice result between the Hotelling’s one sample
T2 and James test for two mean vectors
where J (µ ) is the James test statistic (2.2) and T12 (µ ) and T12 (µ ) are the two one sample
Hotelling’s T2 test statistic values (2.1) for each sample from their common mean (2.3). In
fact, James test statistic is found from minimizing the right hand side of (2.4) with respect to
µ . The sum is minimized when µ takes the form of (2.3). The same is true for the t-test in the
univariate case.
I have created a small code illustrating this result, so this one is for educational purposes.
It calculates the James test statistic, the sum of the two T2 test statistics, the common mean
vector and the one found via numerical optimization. In the univariate case, the common
mean vector is a weighted linear combination of the two sample means. So, if we take a
segment connecting the two means, the common mean is somewhere on that segment. This
is not true however for the multivariate case.
19
names(tests) <- c("James test", "T1(mu) + T2(mu)")
list(tests = tests, mathematics.mean = t(mc), optimised.mean = bar$par )
}
k
∑ (x̄i − X̄)
T
J= Wi (x̄i − X̄) , (2.5)
i =1
where x̄i and ni are the sample mean vector and sample size of the i-th sample respectively
−1
and Wi = Sn i , where Si is the covariance matrix of the i-sample mean vector and X̄ is
i
−1
the estimate of the common mean X̄ = ∑ik=1 Wi ∑ik=1 Wi x̄i . We used the corrected χ2
distribution James (1954) proposed and no bootstrap calibration.
In case you do not have access to James’s paper see page 11 of this document (or send
me an e-mail). Normally one would compare the test statistic (2.5) with a χ2r,1−α , where
r = p (k − 1) are the degrees of freedom with k denoting the number of groups and p the
dimensionality of the data. There are r constraints (how many univariate means must be
equal, so that the null hypothesis, that all the mean vectors are equal, holds true), that is
where these degrees of freedom come from. James compared the test statistic (2.5) with a
corrected χ2 distribution instead. Let A and B be
2
1 k tr I p − W−1 Wi
2r i∑
A = 1+
=1
ni − 1
h 2 i 2
k tr I − W − 1 W − 1
1 p i tr I p − W Wi
r (r + 2) i∑
B = + .
=1
ni − 1 2 ( n i − 1)
20
k <- max(ina) ## the number of groups
p <- dim(x)[2] ## the dimensionality
n <- dim(x)[1] ## the total sample size
## the objects below will be used later
me <- mi <- W <- matrix(nrow = k, ncol = p)
ta <- numeric(k)
wi <- array( dim = c(p, p, k) )
## the next for function calculates the
## mean vector and covariance matrix of each group
for (i in 1:k) {
zi <- x[ina == i, ]
mi[i, ] <- Rfast::colmeans( zi )
wi[, , i] <- ni[i] * chol2inv( chol( var( zi ) ) )
me[i, ] <- mi[i, ] %*% wi[, , i]
}
W <- t( colSums( aperm(wi) ) )
Ws <- solve(W)
ma <- Rfast::colsums(me)
mesi <- Ws %*% ma ## common mean vector
t1 <- t2 <- numeric(k)
Ip <- diag(p)
for (i in 1:k) {
ta[i] <- sum( (mi[i,] - mesi) * ( wi[, , i] %*% (mi[i, ] - mesi) ) )
exa1 <- Ip - Ws %*% wi[, , i]
t1[i] <- sum( diag(exa1) )
t2[i] <- sum( exa1^2 )
}
test <- sum(ta) ## the test statistic
r <- p * (k - 1)
A <- 1 + sum( t1^2/(ni - 1) ) / 2 / r
B <- sum( t2 / (ni - 1) + t1^2 / 2 / (ni - 1) ) / r / (r + 2)
x2 <- qchisq(1 - a, r)
delta <- (A + B * x2)
twoha <- x2 * delta ## corrected critical value of chi-square distribution
pvalue <- pchisq(test/delta, r, lower.tail = FALSE) ## p-value
result <- c(test, delta, twoha, pvalue)
names(result) <- c("test", "correction", "corr.critical", "p-value")
result
}
21
2.7 Relationship between James’s MANOVA and Hotelling’s T2
The relationship we saw for the James two sample test (2.4) is true for the case of the
MANOVA. The estimate of the common mean (2.3) is in general, for g groups, each of sam-
ple size ni , written as
! −1
g g
µc =
µ̂ ∑ ni Si−1 ∑ ni Si−1 X̄i .
i =1 i =1
The next R code is just a proof of the mathematics you will find in Emerson (2009, pg. 76-
81) and is again intended for educational purposes.
22
and Y of sample size n1 and n2 respectively. Their corresponding sample mean vectors and
covariance matrices are X̄, Ȳ and S1 , S2 respectively. The assumption here is the same as
that of the Hotelling’s test we saw before.
Let us define the pooled covariance matrix at first, calculated under the assumption of
equal covariance matrices
( n1 − 1) S1 + ( n2 − 1) S2
Sn = ,
n
where n = n1 + n2 . Then define
s
n2
1 2
Bn = tr (S2n ) − [tr (Sn )] .
( n + 2) ( n − 1) n
Under the null hypothesis (equality of the two mean vectors) the test statistic (2.6) fol-
lows the standard normal distribution. Chen et al. (2010) mentions that Bai and Saranadasa
(1996) established the asymptotic normality of the test statistics and showed that it has at-
tractive power property when p/n → c < ∞ and under some restriction on the maximum
eigenvalue of the common population covariance matrix. However, the requirement of p
and n being of the same order is too restrictive to be used in the ”large p small n” situation.
For this reason Chen et al. (2010) proposed a modification of the test statistic we showed.
Their test statistic is more general and allows for unequal covariance matrices and is appli-
cable in the ”large p small n” situation. This procedure along with some others can be found
in the R package highD2pop created by Gregory (2014).
The code for the test proposed by Bai and Saranadasa (1996) is available below. Note,
that both x and y must be matrices.
23
trSn <- sum( z1^2 )/n + sum( z2^2 ) /n
trSn2 <- sum(Sn^2)/n^2
Bn <- sqrt( n^2/( (n + 2) * (n - 1) ) * (trSn2 - trSn^2/n) )
up <- n1 * n2/(n1 + n2) * sum( (m1 - m2)^2 ) - trSn
down <- sqrt( 2 * (n + 1)/n ) * Bn
Z <- up / down ## test statistic
pvalue <- pnorm(Z, lower.tail = FALSE)
res <- c(Z, pvalue)
names(res) <- c(’Z’, ’p-value’)
res
}
The contrast matrix C has k − 1 independent rows and if there is no treatment effect,
µ = 0 See for more information see Ranjan Maitra’s teaching slides Paired Comparisons
Cµ
and Repeated Measures.
The test statistic is
( n − k + 1) T
T
−1
Tr2 = n (Cx̄) CSC (Cx̄) ∼ Fk−1,n−k+1 .
( n − 1) ( k − 1)
24
## x is the data set
## a is the level of significance set by default to 0.05
m <- Rfast::colmeans(x)
s <- cov(x) ## sample mean vector and covariance matrix
n <- dim(x)[1] ## sample size
p <- dim(x)[2] ## dimensionality of the data
C <- - diag(p)
C[, 1] <- 1
A <- C %*% m
B <- solve(C %*% s %*% C, A)
T2 <- n * sum(A * B)
test <- (n - p + 1) / (n - 1) / (p - 1) * T2 ## test statistic
25
3 Hypothesis testing for covariance matrices
The first section comprises of tests regarding one or more covariance matrices.
where n is the sample size, Σ 0 is the specified covariance matrix under the null hypothesis,
S is the sample covariance matrix and p is the dimensionality of the data (or the number of
variables). Let α and g denote the arithmetic
n o mean and the geometric
mean respectively of
− − −1
the eigenvalues of Σ 0 S, so that tr Σ 0 S = pα and Σ 0 S = g , then (3.1) becomes
1 1 p
−2 log λ = np (α − log( g) − 1)
26
3.2 Multi-sample covariance matrices
We will show the two versions of Box’s test for the hypothesis test of the equality of at least
two covariance matrices: H0 : Σ 1 = . . . = Σ k . The algorithms are taken from Aitchison,
2003, pg. 155 and Mardia et al., 1979, pg. 140.
At first we will see the likelihood-ratio test. This is the multivariate generalization of Bartlett’s
test of homogeneity of variances. The test has the form
k k
−2logλ = n log |S| − ∑ ni log |Si | = ∑ i
−1
n log S i S , (3.2)
i =1 i =1
where Si is the ith sample biased covariance matrix and S = n−1 ∑ik=1 ni Si is the m.l.e. of
the common covariance matrix (under the null hypothesis) with n = ∑ik=1 ni . The degrees
of freedom of the asymptotic chi-square distribution are 12 ( p + 1) (k − 1).
27
crit <- qchisq(1 - a, dof) ## critical value of chi-square distribution
res <- c(test, pvalue, dof, crit)
names(res) <- c(’test’, ’p-value’, ’df’, ’critical’)
res
}
According to Mardia et al., 1979, pg. 140, it may be argued that if ni is small, then (3.2) gives
too much weight to the contribution of S. This consideration led Box (1949) to propose the
test statistic in place of that given in (3.2). Box’s M is given by
k
M = γ ∑ (ni − 1) log Si S p ,
−1
i =1
where
!
k
2p2 + 3p − 1 1 1
γ = 1−
6 ( p + 1) ( k − 1) ∑ ni − 1 − n − k
i =1
and Si and S p are the i-th unbiased covariance estimator and the pooled covariance matrix
respectively with
∑ik=1 (ni − 1) Si
Sp =
n−k
28
for (i in 1:k) mat[, , i] <- cov(x[ina == i, ])
mat1 <- ni * mat
pame <- apply(mat, 3, det) ## the detemirnant of each covariance matrix
## the next 2 lines calculate the pooled covariance matrix
Sp <- colSums( aperm(mat1) ) / ( n - k )
pamela <- det(Sp) ## determinant of the pooled covariance matrix
test1 <- sum( (nu - 1) * log(pamela/pame) )
gama1 <- ( 2 * (p^2) + 3 * p - 1 ) / ( 6 * (p + 1) * (k - 1) )
gama2 <- sum( 1/(nu - 1) ) - 1/(n - k)
gama <- 1 - gama1 * gama2
test <- gama * test1 ## this is the M (test statistic)
dof <- 0.5 * p * (p + 1) * (k - 1) ## degrees of freedom of
## the chi-square distribution
pvalue <- pchisq(test, dof, lower.tail = FALSE) ## p-value
crit <- qchisq(1 - a, dof) ## critical value of chi-square distribution
result <- c(test, pvalue, dof, crit)
names(result) <- c(’M.test’, ’p-value’, ’df’, ’critical’)
result
}
29
4 Correlation, regression and discriminant analysis
In this section we will present functions for correlation, multivariate regression and discrim-
inant analysis.
4.1 Correlation
4.1.1 Correlation coefficient confidence intervals and hypothesis testing using Fisher’s
transformation
exp (2ẑ) − 1
exp (2ẑ) + 1
The estimated standard error of (4.1) is √n1−3 (Efron and Tibshirani, 1993, pg. 54). If on
the other hand, you choose to calculate Spearman’s correlation coefficients, the estimated
standard error is slightly different ' 1.029563
√
n −3
(Fieller et al., 1957, Fieller and Pearson, 1957). R
calculates confidence intervals based in a different way and does hypothesis testing for zero
values only. The following function calculates asymptotic confidence intervals based upon
(4.1), assuming asymptotic normality of (4.1) and performs hypothesis testing for the true
(any, non only zero) value of the correlation. The sample distribution though is a tn−3 .
30
zh1 <- 0.5 * log( (1 + r) / (1 - r) ) ## Fisher transformation for H1
se <- 1.029563 / sqrt(n - 3) ## standard error for Fisher transformation of Ho
}
test <- (zh1 - zh0)/se ## test statistic
pvalue <- 2 * ( 1 - pt( abs(test), n - 3 ) ) ## p-value
zL <- zh1 - qt(1 - a/2, n - 3) * se
zH <- zh1 + qt(1 - a/2, n - 3) * se
fishL <- (exp(2 * zL) - 1)/(exp(2 * zL) + 1) ## lower confidence limit
fishH <- (exp(2 * zH) - 1)/(exp(2 * zH) + 1) ## upper confidence limit
ci <- c(fishL, fishH)
names(ci) <- paste(c( a/2 * 100, (1 - a/2) * 100 ), "%", sep = "")
r0 <- seq( max(-0.99, r - 0.2), min(0.99, r + 0.2), by=0.001 )
z0 <- 0.5 * log( (1 + r0) / (1 - r0) ) ## Fisher’s transformation
## for many Hos
stat <- abs(zh1 - z0)/se ## test statistics
pval <- 2 * pt( -abs(stat), n - 3 ) ## p-values
if ( plot ) {
par( mfrow = c(1,2) )
plot(r0, stat, type = "l", xlab = "Correlation values",
ylab = "Test statistic")
abline(h = qnorm(0.975), col = 2)
abline( v = min( r0[stat < qt(0.975, n - 3)] ), col = 3, lty = 3 )
abline( v = max( r0[stat < qt(0.975, n - 3)] ), col = 3, lty = 3 )
plot(r0, pval, type = "l", xlab = "Correlation values",
ylab = "P-values")
abline(h = a, col = 2)
abline(v = min(r0[pval > a]), col = 3, lty = 3)
abline(v = max(r0[pval > a]), col = 3, lty = 3)
}
31
4.1.2 Non-parametric (bootstrap and permutation) hypothesis testing for a zero correla-
tion coefficient
We show how to perform a non-parametric bootstrap hypothesis testing that the correlation
coefficient is zero. A good pivotal statistic is the Fisher’s transformation (4.1). Then the data
have to be transformed under the null hypothesis (ρ = 0). This is doable via the eigen-
analysis of the covariance matrix. We transform the bivariate data such that the covariance
(and thus the correlation) matrix equals the identity matrix (see the function of standardiza-
tion for more information about this). We remind that the correlation matrix is independent
of measurements and is location free. The next step is easy, we draw bootstrap samples
(from the transformed data) and every time we calculate the Fisher’s transformation. The
bootstrap p-value is calculated in the usual way (Davison and Hinkley, 1997).
32
}
If you want to perform a non-parametric bootstrap hypothesis for a value of the correla-
tion other than zero the procedure is similar. The data have already been transformed such
that their correlation is zero. Now instead of the zeroes in the off-diagonal values of the
identity matrix you will have the value of the correlation matrix you want to test. Eigen
analysis of the matrix is performed and the square root of the matrix is used to multiply the
transformed data. I could write a more general function to include all case, but I will leave
this task to you. If you do write it please send it to me and I will put it with your name of
course.
The next function is a vectorised version of the above function. Instead of using a for
loop you can do things vectorised. This idea cam when I found the vectorised bootstrap
correlation by Neto (2015). I cannot say I understood fully what he did, so I decided to write
my own code based on the direction he pointed.
Pearson’s correlation coefficient of x and y for a sample size n is given by
∑in=1 xi yi − n x̄ ȳ
r= q . (4.2)
n 2
n 2
∑i=1 xi − n x̄ 2 ∑i=1 yi − nȳ 2
So, we can see that need 5 terms to calculate, ∑in=1 xi yi , x̄, ȳ, ∑in=1 xi2 and ∑in=1 y2i . After
transforming the data under the null hypothesis using the spectral decomposition we pro-
ceed as follows with B number of resamples.
1. Set a seed number in R, such as 123456. This is to make sure that the pairs of ( xi , yi )
are still the same.
2. Sample with replacement B × n values of x and put them in a matrix with n rows and
B columns, named XB.
3. Sample with replacement B × n values of y and put them in a matrix with n rows and
B columns, names YB.
4. Calculate the mean vector of XB and YB. These are the means of the bootstrap samples
of x and y respectively ().
5. Calculate the sum vector of XB2 and YB2 . These are the sums of the squares of the
bootstrap samples of x and y respectively.
6. Finally calculate the sum vector of XB ∗ YB. This is the term ∑in=1 xi yi for all resamples.
33
So we now have 5 vectors containing the 5 terms we want. We calculate the correla-
tion coefficient (4.2) and then the Fisher’s transformation (4.1) and so we have B bootstrap
test statistics. In order to see the time gain I tested both of these functions with B = 9999
resamples and 1000 repetitions. The function boot.correl required 538 seconds, whereas the
function bootcor required 140. The time is reduced to 1/4 of its initial. The gain is not su-
per wow, I would like it if it was 1/10, but even saw, it is still good. Parallelised versions
reduce time to 1/3, so from this perspective, I did better. If we now put parallel inside this
vectorised version, computations will be even faster. I leave this with you.
But, I noticed one thing, the same thing Neto (2015) mentions. For big sample sizes, for
example 1000 pairs, the time difference is not that big and perhaps for is faster. The big
difference is in the small to moderate sample sizes. At least for this example. What I mean
by this is that you should not be afraid and say, then why? If I have big sample, I do not
need vectorization. Maybe yes, but even then I still recommend it. Maybe someone else will
have a better alternative for vectorization which is better even in the big samples, for the
correlation of course. In the contour plots though, vectorised versions are always faster no
matter what.
34
hist(tb, xlab = "Bootstrapped test statistic", main = " ")
abline(v = test, lty = 2, lwd = 2)
## The dotted vertical line is the test statistic value
result <- c(r, pvalue)
names(result) <- c(’correlation’, ’p-value’)
result
}
Next is a permutation based p-value for the same test. The idea is this: instead of trans-
forming the data under the null hypothesis and re-sampling with replacement we can per-
mute the observations. Te basic difference is that the data are assumed to be under the null
hypothesis already. Secondly, what we have to do, is to destroy the pairs. For example, the
pairs (a, b), (c, d) and (e, f) in one permutation they can be (c, b), (a, f) and (e, d). And this
thing will happen many times, say R = 999. Then we have R pseudo-samples again. Every-
thing else is the same as in the bootstrap case. A trick is that we need not change the order
of both variables, just the one is enough. This will sped up the process. And guess what, it is
faster than bootstrap. It does not require the data to be transformed under the null hypothe-
sis and you only need to permute one variable, in contrast to the bootstrap case, where you
must resample from both variables.
35
I believe though this version with a for is faster than the vectorised version. A possible
reason for this is that R cannot handle big matrices easily.
Suppose you have a (dependent) variable Y and a a matrix of many variables X and you
want to get all the correlations between Y and Xi for all i. if you type cor (y, x ) in you
will get a vector of the correlations. What I offer here is confidence interval for each of the
correlations, the test statistic and the p-values for the hypothesis that each of them is equal
to some value ρ. The p-values and test statistics are useful for meta-analysis for example,
combination of the p-values in one or even to see the false discovery rate (see the package
fdrtool by Korbinian Strimmer).
36
## x is a matrix containing at least two variables
## type supported is either "pearson" or "spearman"
## a is the significance level
## rho is the hypothesised correlation
n <- length(y)
if (type == "pearson") {
r <- as.vector( cor(y, x) ) ## the correlation value between y and all the xs
zh0 <- 0.5 * log( (1 + rho) / (1 - rho) ) ## Fisher’s transformation for Ho
zh1 <- 0.5 * log( (1 + r) / (1 - r) ) ## Fisher’s transformation for H1
se <- 1/sqrt(n - 3) ## standard error for Fisher’s transformation of Ho
} else if (type == "spearman") {
r <- as.vector( cor(y, x, method = "spearman") ) ## the correlation value
## between y and all the xs
zh0 <- 0.5 * log( (1 + rho) / (1 - rho) ) ## Fisher transformation for Ho
zh1 <- 0.5 * log( (1 + r) / (1 - r) ) ## Fisher transformation for H1
se <- 1.029563 / sqrt(n - 3) ## standard error under Ho
}
test <- as.vector( (zh1 - zh0) / se ) ## test statistic
pvalue <- 2 * ( pt( -abs(test), n - 3 ) ) ## p-value
b1 <- zh1 - qt(1 - a/2, n - 3) * se
b2 <- zh1 + qt(1 - a/2, n - 3) * se
ca <- cbind(b1 ,b2)
ela <- exp( 2 * ca )
ci <- ( ela - 1 ) / ( ela + 1 ) ## confidence intervals
res <- cbind(r, pvalue, test, ci)
colnames(res) <- c( ’correlation’, ’p-value’, ’z-stat’,
paste( c( a/2 * 100, (1 - a/2) * 100 ), "%", sep = "") )
if ( is.null(colnames(x)) ) {
rownames(res) <- paste("X", 1:dim(x)[2], sep = "")
} else rownames(res) <- colnames(x)
res
}
Below is the sam function as above, only this time the p-value is produced via permuta-
tions and no confidence intervals are produced.
37
## R is the number of permutations
p <- dim(x)[2]
n <- length(y)
r <- as.vector( cor(y, x) )
test <- 0.5 * log( (1 + r) / (1 - r) ) ## the test statistic
m1 <- sum(y) ; m12 <- sum(y^2)
m2 <- Rfast::colSums(x) ; m22 <- Rfast::colsums(x^2)
up <- m1 * m2 / n
down <- sqrt( (m12 - m1^2 / n) * (m22 - m2^2 / n) )
sxy <- matrix(0, p, R)
for (i in 1:R) {
y1 <- sample(y, n)
sxy[, i] <- Rfast::colsums(y1 * x)
}
rb <- (sxy - up) / down
tb <- 0.5 * log( (1 + rb)/(1 - rb) ) ## the test statistic
pvalue <- ( Rfast::rowsums( abs(tb) > abs(test) ) + 1 ) / (R + 1)
res <- cbind( r, pvalue )
colnames(res) <- c(’correlation’, ’p-value’)
if ( is.null(colnames(x)) ) {
rownames(res) <- paste("X", 1:dim(x)[2], sep = "")
} else rownames(res) <- colnames(x)
res
}
Suppose you want to calculate the correlation coefficient between two variables controlling
for the effect of (or conditioning on) one or more other variables. So you cant to calculate
ρ̂ ( X, Y |Z), where Z is a matrix, since it does not have to be just one variable. This idea was
captures by Ronald Fisher some years ago. To calculate it, one can use linear regression as
follows.
3. Calculate the correlation between êx and êy . This is the partial correlation coefficient
between X and Y controlling for Z.
The standard error of the Fisher’s transformation of the sample partial correlation is
38
(Anderson, 2003)
1 + ρ̂ ( X, Y |Z)
1 1
SE log = ,
2 1 − ρ̂ ( X, Y |Z) n−d−3
where n is the sample size and d is the number of variables upon which we control. The
standard error is very similar to the one of the classical correlation coefficient. In fact, the
latter one is a special case of the first when d = 0 and thus there is no variable whose effect
is to be controlled. The R code below calculates the partial correlation coefficient, performs
hypothesis testing and calculates confidence intervals.
partial.corr <- function(y, x, z, type = "pearson", rho = 0, a = 0.05, plot = F) {
## y and x are the two variables whose correlation is of interest
## z is a set of variable(s), one or more variables
## It accepts two types only, either "pearson" or "spearman"
## over which the condition takes place
## rho is the hypothesised correlation
## a is the significance level, set to 0.05 by default
n <- length(y) ## sample size
d <- dim(z)[2] ## dimensionality of z
res <- resid( lm( cbind(y, x) ~ z ) ) ## residuals of y and x on z
r <- cor(res, method = type)[1, 2] ## partial correlation of y and
## x conditioning on z
zh0 <- 0.5 * log( (1 + rho) / (1 - rho) ) ## Fisher’s transform for Ho
zh1 <- 0.5 * log( (1 + r) / (1 - r) ) ## Fisher’s transform for H1
if (type == "pearson") {
se <- 1/sqrt(n - d - 3) ## standard error under Ho
} else if ( type == "spearman" ){
se <- 1.029563 / sqrt(n - d - 3) ## standard error under Ho
}
test <- (zh1 - zh0)/se ## test statistic
pvalue <- 2 * ( 1 - pt( abs(test), n - d - 3 ) ) ## p-value
zL <- zh1 - qt(1 - a/2, n - d - 3) * se
zH <- zh1 + qt(1 - a/2, n -d - 3) * se
fishL <- (exp(2 * zL) - 1)/(exp(2 * zL) + 1) ## lower confidence limit
fishH <- (exp(2 * zH) - 1)/(exp(2 * zH) + 1) ## upper confidence limit
ci <- c(fishL, fishH)
names(ci) <- paste( c( a/2 * 100, (1 - a/2) * 100 ), "%", sep = "" )
r0 <- seq( max(-0.99, r - 0.2), min(0.99, r + 0.2), by = 0.001 )
z0 <- 0.5 * log( (1 + r0)/(1 - r0) ) ## Fisher’s transformation
## for many Hos
39
stat <- abs(zh1 - z0)/se ## test statistics
pval <- 2 * ( 1 - pt( abs(stat), n - d - 3 ) ) ## p-values
if (plot) {
par(mfrow = c(1, 2))
plot(r0, stat, type = "l", xlab = "Correlation values",
ylab = "Test statistic")
abline( h = qt(0.975, n - d - 3), col = 2 )
abline( v = min( r0[stat < qt(0.975, n - d - 3)] ), col = 3, lty = 3 )
abline( v = max( r0[stat < qt(0.975, n - d - 3)] ), col = 3, lty = 3 )
plot(r0, pval, type = "l", xlab = "Correlation values",
ylab = "P-values")
abline(h = a, col = 2)
abline( v = min( r0[pval > a] ), col = 3, lty = 3 )
abline( v = max( r0[pval > a] ), col = 3, lty = 3 )
}
result <- c(r, pvalue)
names(result) <- c(’correlation’, ’p-value’)
list(result = result, ci = ci)
}
Suppose you want to calculate the partial correlation matrix, where each coefficient has
been conditioned on all the other variables. One way would be to use a for loop or a similar
function and fill the matrix. Opgen-Rhein and Strimmer (2006) uses a much more convenient
and faster way. This way is implemented in the corpcor package written by Schaefer et al.
(2007). The option for a Spearman based partial correlation is now available. The steps of
the calculation are described here
1. Calculate the correlation coefficient and then change the sign of all correlations.
2. Calculate the inverse of the previous matrix. (Schaefer et al. (2007) use the Moore-
Penrose inverse for this purpose, but I don’t. They have other things in mind, more
general than these).
40
## it can of course be "kendall" but I have not examined it
## in other functions
r <- cor(x, method = type) ## correlation matrix of x
r2 <- - chol2inv( chol(r) )
diag(r2) <- -diag(r2)
cov2cor(r2)
}
The test statistic for the hypothesis of equality of two correlation coefficients is the following:
ẑ1 − ẑ2
Z= p ,
1/ (n1 − 3) + 1/ (n2 − 3)
where ẑ1 and ẑ2 denote the Fisher’s transformation (4.1) applied to the two correlation co-
efficients and n1 and n2 denote the sample sizes of the two correlation coefficients. The
denominator is the sum of the variances of the two coefficients and as you can see we used a
different variance estimator than the one we used before. This function performs hypothesis
testing for the equality of two correlation coefficients. The result is the calculated p-value
from the standard normal distribution.
if (type == "pearson") {
test <- (z1 - z2) / sqrt( 1/(n1 - 3) + 1 / (n2 - 3) ) ## test statistic
} else if (type == "spearman") {
test <- (z1 - z2) / sqrt( 1.029563/(n1 - 3) + 1.029563 / (n2 - 3) )
}
41
4.1.7 Squared multivariate correlation between two sets of variables
Mardia et al., 1979, pg. 171 defined two squared multiple correlation coefficient between
the dependent variable Y and the independent variable X. They mention that these are a
similar measure of the coefficient determination in the univariate regression. Assume that
the multivariate regression model (more in Section 4.2) is written as
Y = XB + U,
−1 T
where U is the matrix of residuals. Then, they write D = Y T Y Û Û, with Û T Û = Y T PY
and P is defined in (4.3). The matrix D is a generalization of 1 − R2 in the univariate case.
Mardia et al., 1979, pg. 171 mentioned that the dependent variable Y has to be centred.
The squared multivariate correlation should lie between 0 and 1 and this property is
satisfied by the trace correlation r T and the determinant correlation r D , defined as
respectively, where d denotes the dimensionality of Y. So, high values indicate high propor-
tion of variance of the dependent variables explained. Alternatively, one can calculate the
−1 T
trace and the determinant of the matrix E = Y T Y Ŷ Ŷ. Try something else also, use the
sq.correl function in a univariate regression example and then calculate the R2 for the same
dataset. Try this example again but without centering the dependent variable in the sq.correl
function. In addition, take two variables and calculate their squared correlation coefficient
(cor and then square it) and using the function below.
42
result
}
4.2 Regression
4.2.1 Classical multivariate regression
In this function we assume that both the dependent and independent variables can either
be vectors or matrices. The parameters of the independent variables are estimated through
maximum likelihood estimation procedures and the final formula is the following
−1
B̂ = X T X XY,
where X is the set of independent variables, or the design matrix, with the first column being
the vector of 1s and Y is the multivariate (or univariate) dependent variable. The covariance
matrix of the estimated parameters is given by this formula
−1
Σe ⊗ X X
V̂ B̂ = Σ̂ T
,
Σe =
where Σ̂ 1 T
n− p−1 Y PY with
−1
P = In − X X T X XT (4.3)
is the error covariance matrix. The sample size is denoted by n, p indicates the number of
independent variables and ⊗ is the Kronecker product of two matrices.
In order to see if an observation is an outlier or leverage (influential) point several tech-
niques have been suggested in the literature. We will use a simple graphical procedure. We
will calculate the Mahalanobis distances of the residuals and of the observations in the X
space
q q
−1 −1
DEi = Σ e êiT
êiT Σ̂ and DXi = µ X ) T Σ̂
(Xi − µ̂ Σ XX (Xi − µ̂
µX ) (4.4)
43
multivreg <- function(y, x, plot = TRUE, xnew = NULL) {
## y is the dependent variable and must be a matrix
## with at least two columns
## x contains the independent variable(s) which have to be
## in a matrix format or a vector if you have just one
if ( plot ) {
plot(dx, dres, xlim = c(0, max(dx) + 0.5), ylim = c(0, max(dres) + 0.5),
xlab = "Mahalanobis distance of x", ylab = "Mahalanobis distance
of residuals")
abline(h = crit.res)
abline(v = crit.x)
}
if ( is.null(xnew) ) {
est <- fitted(mod)
} else {
xnew <- cbind(1, xnew)
44
est <- xnew %*% coef(mod)
}
for (i in 1:d) {
wa <- as.matrix( coef(moda[[i]]) )
wa <- cbind( wa, wa[, 1] - qt(0.975, n - p - 1) * mse[i] ,
wa[, 1] + qt(0.975, n - p - 1) * mse[i] )
colnames(wa)[5:6] <- paste(c(2.5, 97.5), "%", sep = "")
suma[, , i] <- wa
r.squared[i] <- as.numeric( moda[[i]]$r.squared )
}
if ( is.null(colnames(y)) ) {
dimnames(suma) <- list( rownames(wa), colnames(wa),
paste("Y", 1:d, sep = "") )
names(r.squared) <- paste("Y", 1:d, sep = "")
colnames(est) <- paste("Y", 1:d, sep = "")
} else {
dimnames(suma) <- list( rownames(wa), colnames(wa), colnames(y) )
names(r.squared) <- colnames(y)
colnames(est) <- colnames(y)
}
Unfortunately, the function accepts only a design matrix (without the first line of ones).
So, if you have categorical and continuous or categorical only independent variables then
the next function would be useful. Suppose x1 consists of one or more continuous variables,
i.e. it is either a vector or matrix and x2 is a categorical variable. Then you have to do the
following thing in R to obtain the design matrix mat.
ff <- y ~ x1+ x2
45
m <- model.frame(ff)
mat <- model.matrix(ff, m)[, -1]
Then, you go to the next function and put mat in the place of x.
multivreg(y, mat)
Alternatively, if you have many independent variables and you cannot use the previous
step you can use the next function. But, bear in mind that it does not calculate outliers in the
independent variables space. The next function, being more general than multivreg offers
bagging (bootstrap aggregation). Suppose you want to predict the values (Y) of some new
data Xnew (which could be your observed data and in this case you obtain bagged fitted val-
ues). Note that in this case I do no standardization, so if you wanted to do standardization
you would have to be careful (see for example k-NN regression, presented later).
The idea is simple, you bootstrap your observed data (X, Y) and fit the linear model on
the bootstrapped sample. Use this model to obtain estimates Ŷb for the Xnew . Repeat this
∑B Ŷ
B = 100 or B = 200 times and take the average of the estimates Ŷ = b=B1 b . This idea is
attributed to Breiman (1996). Note also, that the univariate R2 values, the coefficients and
their standard errors are calculated from the classical regression. Bootstrap aggregation is
applied only for the predictions.
46
esa <- array( dim = c(n, d, B) )
for (i in 1:B) {
ina <- sample(1:n, n, replace = TRUE)
mod <- lm(y[ina, ] ~ ., data = x[ina, ]) ## linear regression
esa[, , i] <- predict(mod, xnew) ## predict the xnew
}
est <- apply(esa, 1:2, mean)
}
moda <- summary(mod)
p <- nrow(coef(moda)[[1]])
suma <- array(dim = c(p, 6, d))
r.squared <- numeric(d)
mse <- deviance(mod)/( n - p - 1 )
for (i in 1:d) {
wa <- as.matrix( coef(moda[[i]]) )
wa <- cbind( wa, wa[, 1] - qt(0.975, n - p - 1) * mse[i],
wa[, 1] + qt(0.975, n - p - 1) * mse[i] )
colnames(wa)[5:6] <- paste(c(2.5, 97.5), "%", sep = "")
suma[, , i] <- wa
r.squared[i] <- as.numeric( moda[[i]]$r.squared )
}
if ( is.null(colnames(y)) ) {
dimnames(suma) <- list( rownames(wa), colnames(wa),
paste("Y", 1:d, sep = "") )
names(r.squared) <- paste("Y", 1:d, sep = "")
colnames(est) <- paste("Y", 1:d, sep = "")
} else {
dimnames(suma) <- list( rownames(wa), colnames(wa), colnames(y) )
names(r.squared) <- colnames(y)
colnames(est) <- colnames(y)
}
list(suma = suma, r.squared = r.squared, est = est)
}
This is a non-parametric regression which depends only upon the distances among the in-
dependent variables. It involves a tuning, choice of a free parameter, whatever you want to
call it. That is k, the number of nearest neighbours. Hence, k-NN stands for k nearest neigh-
47
bours. The dependent variable can be either univariate or multivariate, but the independent
variables must be numerical, continuous.
The next code performs k-NN multivariate, or univariate if you have a univariate de-
pendent variable, regression for a given value of k. At first it standardises the independent
variables and then uses the same mean and standard deviation to scale the new (indepen-
dent variables) observations as well. If the xnew argument in the function is the same as x,
the fitted values will be returned.
y <- as.matrix(y)
d <- dim(y)[2] ## dimensions of y
p <- dim(x)[2] ## dimensions of x
n <- dim(y)[1]
xnew <- as.matrix(xnew)
xnew <- matrix(xnew, ncol = p)
nu <- dim(xnew)[1]
m <- Rfast::colmeans(x)
s <- Rfast::colVars(x, std = TRUE)
x <- t( ( t(X) - m ) / s ) ## standardize the independent variables
ina <- 1:n
if (p == 1) {
xnew <- (xnew - m) / s
} else {
s <- diag(1/s)
xnew <- ( xnew - rep(m, rep(nu, p)) ) %*% s ## standardize the xnew values
}
48
if (p == 1) {
x <- as.matrix(x)
x <- matrix(x, ncol = p)
xnew <- as.matrix(xnew)
xnew <- matrix(xnew, ncol = p)
}
apostasi <- dist(rbind(xnew, x), method = type, diag = TRUE, upper = TRUE)
apostasi <- as.matrix(apostasi)
est <- matrix(nrow = nu, ncol = dim(y)[2])
dis <- apostasi[1:nu, -c(1:nu)]
nam <- 1:n
est <- matrix(nrow = nu, ncol = d)
if (estim == "arithmetic") {
for (i in 1:nu) {
xa <- cbind(ina, disa[i, ])
qan <- xa[order(xa[, 2]), ]
a <- qan[1:k, 1]
yb <- as.matrix( y[a, ] )
est[i, ] <- Rfast::colmeans( yb )
}
if ( is.null(colnames(y)) ) {
colnames(est) <- paste("yhat", 1:d, sep = "" )
} else colnames(est) <- colnames(y)
if (d == 1) est <- as.vector(est)
est
49
}
A cross validation algorithm to choose the value of k is described below and after that
the relevant code is given below.
Since I am interested in prediction analysis I will use a K-fold cross-validation to choose
the value of α. I split the data into K sets (fold). Every time I leave a set out and fit the model
in the remaining sample (chose the best value of k and so on). Then, I scale the test set,
using the mean and standard deviation of the training set, and calculate the MSPE in order
to measure the performance. This is repeated for all K sets (folds) of data and the average
MSPE is computed.
But, since many models are being tested at every time (each value of k gives a different
model) the resulting performance is a bit biased, a bit overestimated. To overcome this,
nested cross-validation (Aliferis et al., 2010, Statnikov et al., 2005, Tsamardinos et al., 2014)
could be used, but since this is a computationally heavier design we rely on the method
suggested by Tibshirani and Tibshirani (2009), termed hereafter as TT.
Calculate the best performance as the minimum of the average (over all folds) perfor-
mance and keep the corresponding value of k which minimizes the performance. Call this
k∗ . For each fold extract the best performance and subtract from it the performance when
using the best k∗ . The estimated bias is the average of these differences. Finally, add this
bias to the overall performance. The chosen, best value of k does not change, the estimated
performance changes.
The function knnreg.tune has the two following two features. At first, for all different
values of k, the training and test samples are always the same. Secondly, there is the option
of seed. If it is true, then no matter how many times we repeat the analysis, the split of the
folds is always the same and thus the results will be the same.
50
## harmonic mean of the closest observations.
y <- as.matrix(y)
n <- dim(y)[1]
d <- dim(y)[2]
if ( is.null(mat) ) {
nu <- sample(1:n, min( n, round(n / M) * M ) )
## It may be the case this new nu is not exactly the same
## as the one specified by the user
## to a matrix a warning message should appear
suppressWarnings()
mat <- matrix( nu, ncol = M )
} else mat <- mat
M <- dim(mat)[2]
rmat <- dim(mat)[1]
per <- matrix(nrow = M, ncol = A - 1)
if (ncores == 1) {
for (vim in 1:M) {
ytest <- as.matrix( y[mat[, vim], ] ) ## test set dependent vars
ytrain <- as.matrix( y[-mat[, vim], ] ) ## train set dependent vars
xtrain <- as.matrix( x[-mat[, vim], ] ) ## train set independent vars
xtest <- as.matrix( x[mat[, vim], ] ) ## test set independent vars
for ( l in 1:c(A - 1) ) {
knn <- l + 1
est <- knn.reg(xtest, ytrain, xtrain, knn, type = type, estim = estim)
per[vim, l] <- sum( (ytest - est)^2 ) / rmat
}
}
} else {
require(doParallel, quiet = TRUE, warn.conflicts = FALSE)
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
pe <- numeric(A - 1)
per <- foreach(i = 1:M, .combine = rbind, .export = "knn.reg") %dopar% {
## will always be the same
ytest <- as.matrix( y[mat[, i], ] ) ## test set dependent vars
51
ytrain <- as.matrix( y[-mat[, i], ] ) ## train set dependent vars
xtrain <- as.matrix( x[-mat[, i], ] ) ## train set independent vars
xtest <- as.matrix( x[mat[, i], ] ) ## test set independent vars
for ( l in 1:c(A - 1) ) {
knn <- l + 1
est <- knn.reg(xtest, ytrain, xtrain, knn, type = type, estim = estim)
pe[l] <- sum( (ytest - est)^2 ) / rmat
}
return(pe)
}
stopCluster(cl)
}
If you take a look at the above code you will see that each time, for every training and
test set splits I calculate the distance matrix. In the classification task using the k-NN, for
compositional data (Section for compositional data) you will see that I calculate the distance
matrix once and then I simply remove rows and columns corresponding to the test set. The
reason for this here is the scaling. I scale (standardize each variable) the training set and
then use those means and standard deviations to scale the test set. I am pretending that the
test set contains new data for which I know nothing. If I scaled the whole dataset from the
beginning that would induce positive bias, it would overestimate the performance of the
regression. I do not want that, I want to test and train my regression algorithm as fairly and
unbiasedly as possible.
Kernel regression is another form of non parametric regression. But let us see what is the
kernel. at first we will say that a good book for kernel density estimation is the Wand and
52
Jones (1995) one. The book might seem difficult for introduction but once you take the hand
of it, then you appreciate its value. Another very good book is written by Tsybakov (2009).
The kernel function estimating the (univariate) density of a value has this form
1 n Xi − x
nh i∑
ˆ
f ( x; h) = K . (4.5)
=1
h
An example of a kernel function is the standard normal. Thus, (4.5) can be written as
n 2
− ( Xi − x )
1 −
ˆ h) =
f ( x; √
nh 2π
∑e 2h2 . (4.6)
i =1
There are many kernel functions in the literature. For this reason we also use another
one, which is based on the L1 metric denoted as Laplacian kernel by Kim and Scott (2012)
n
ˆ h) = c
| Xi − x |
f ( x; ∑
nh i=1
e− h , (4.7)
Let us now see what are all these matrices. The Y is the n × q dependent variables matrix,
where q denotes the dimensionality of Y. The Wx is an n × n diagonal matrix containing the
kernel functions for all the observations
X1 − x Xn − x
Wx = diag K ,...K .
h h
53
X (x, p) is a n × ( p + 1) matrix of the independent variables defined as
2 p
1 X1 − x ( X1 − x ) . . . ( X1 − x )
. .. .. ..
X (x, p) = . .
. . . .
2 p
1 Xn − x (Xn − x) . . . (Xn − x)
We subtract the value x from every independent variable and all the sample values. Then
we decide on the degree p of the local polynomial. For this reason kernel regression is also
called local polynomial regression. The polynomial is applied locally to each point whose
dependent variable we want to estimate. In my R function I allow only for p = 0 and p = 1,
because I think it gets too complicated afterwards, especially as the number of variables
increases.
If p = 0 then we end up with the Nadaraya-Watson estimator (Nadaraya, 1964, Watson,
1964) and in this case (4.8) can also be written as (Tsybakov, 2009)
Xi − x
∑in=1 K h Yi n
Xi − x
m̂ ( x, 0, h) =
Xi − x
if ∑K h
6= 0
∑in=1 K h i =1
Xi − x
∑in=1 K
and m̂ ( x, 0, h) = 0 if = 0.
h
Another key thing we have to note is the choice of the bandwidth h. Since we are in the
multivariate case the bandwidth is a q × q matrix H having many smoothing parameters
if we think that even for q = 2 we need 4 smoothing parameters. To keep it simple I made
it H = h2 Iq , where Iq is the identity matrix. Thus the kernel functions (4.6) and (4.7) are
written as
n −kXi −xk2 n
1 ˆ h) = c
k Xi − x k
−
ˆ h) =
f (x; ∑e 2h2 and f ( x; ∑ e− h
nhd (2π )d/2 i =1 nhd i =1
respectively, where k.k stands for the Euclidean metric and kx − yk1 = ∑id=1 | xi − yi |. Since
we are doing regression, note that the part which is outside the two sums cancels out.
Standardization of the independent variables is a must I would say, and so I did here.
This means, that this functions allows only for continuous variables. The next code per-
forms local polynomial regression for a given polynomial which I restrict it to be at most 1.
It estimates the value of the dependent variable (univariate or multivariate) based on mea-
surements from the continuous (only) independent variable(s). At first it standardises the
independent variables and then uses the same mean and standard deviation to scale the new
(independent variables) observations as well. This was also done by Lagani et al. (2013).
54
## X contains the independent variable(s)
## x are the new independent variable(s) values
## h is the bandwidth
## r is the degree of the local polynomial.
## r is set by default to 0. This corresponds to Nadaraya-Watson estimator
## type denotes the type of kernel to be used, ’gauss’ or ’laplace’
Y <- as.matrix(Y)
X <- as.matrix(X)
x <- as.matrix(x)
d <- dim(Y)[2]
p <- dim(X)[2]
n <- dim(Y)[1]
nu <- dim(x)[1]
m <- Rfast::colmeans(X)
s <- Rfast::colVars(X, std = TRUE)
tX <- ( t(X) - m ) / s ## standardize the independent variables
X <- t(tX)
tx <- ( t(x) - m ) / s ## standardize the x values
x <- t(tx)
XX <- tX
xx <- x
if (type == "gauss") {
h <- sqrt(h)
a1 <- 0.5 * Rfast::dista(x, x )^2/h^2
} else a1 <- Rfast::dista(x, x, type = "manhattan" )/h
z <- exp(- a1)
ta <- Rfast::rowsums(z)
mhx <- ( z %*% Y) / ta
mhx[ is.na(mhx) ] <- 0
if ( is.null(colnames(Y)) ) {
colnames(mhx) <- paste("yhat", 1:d, sep = "" )
} else colnames(mhx) <- colnames(Y)
if (d == 1) mhx <- as.vector(mhx)
}
mhx
55
}
My way to choose h is rather simple but it works. I use 1-fold cross validation in almost
the same manner that was described in the k-NN multivariate regression before. Instead of
choosing a value of k I choose a value of h and the algorithm contains more repetitions. But
apart from this, all the other steps are the same. The next code chooses the value of h for a
given local polynomial. This means, that one can change the order of the polynomial and
see if the MSPE is reduced.
If the option seed is true, then no matter how many times we repeat the analysis, the spit
between training and test samples is always the same and thus the results will be the same.
The same seed number is used in the functions knn.tune and pcr.tune. Thus, the MSPE for all
three methods is directly comparable.
Y <- as.matrix(Y)
n <- dim(Y)[1]
56
## the commands outside this function
if (ncores == 1) {
for (vim in 1:M) {
ytest <- as.matrix( Y[mat[, vim], ] ) ## test set dependent vars
ytrain <- as.matrix( Y[-mat[, vim], ] ) ## train set dependent vars
xtrain <- as.matrix( X[-mat[, vim], ] ) ## train set independent vars
xtest <- as.matrix( X[mat[, vim], ] ) ## test set independent vars
for ( j in 1:length(h) ) {
est <- kern.reg(xtest, ytrain, xtrain, h[j], type = type)
msp[vim, j] <- sum( (ytest - est)^2 ) / rmat
}
}
} else {
require(doParallel, quiet = TRUE, warn.conflicts = FALSE)
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
pe <- numeric( length(h) )
msp <- foreach(i = 1:M, .combine = rbind, .export = "kern.reg") %dopar% {
ytest <- as.matrix( Y[mat[, i], ] ) ## test set dependent vars
ytrain <- as.matrix( Y[-mat[, i], ] ) ## train set dependent vars
xtrain <- as.matrix( X[-mat[, i], ] ) ## train set independent vars
xtest <- as.matrix( X[mat[, i], ] ) ## test set independent vars
for ( j in 1:length(h) ) {
est <- kern.reg(xtest, ytrain, xtrain, h[j], type = type)
pe[j] <- sum( (ytest - est)^2 ) / rmat
}
return(pe)
}
stopCluster(cl)
}
57
performance <- c( min(mspe) + estb, estb)
names(performance) <- c("MSPE", "Estimated bias")
list(mspe = mspe, h = which.min(mspe), performance = performance)
}
I decided to put this technique here (and not in a subsequent Section), in the regression
context since principal components analysis is used as a tool for regression. In some, the
idea is that one can use principal component analysis on the independent variables in a
unidimensional (dependent variable is univariate) regression setting. A good reason to do
so is either because there is a high number of independent variables and or because there
are collinearity problems. One or more of the continuous independent variables are highly
correlated other variables. This method has however some limitations (see for example Hadi
and Ling, 1998).
The algorithm to perform principal components regression can be described as follows
1. At first standardize the independent variables. This way, X T X (the n × p design matrix,
which includes the p independent variables but not the intercept term) is proportional
to the the correlation matrix for the predictor variables. This is what ? does. The n
stands for the sample size.
2. Perform eigen analysis on X T X and calculate the matrix of the eigenvectors V and the
scores Z = XV.
where σ2 is the conditional variance of the dependent variable calculated from the
classical multiple regression analysis based upon the given number of principal com-
ponents. It is the error variance, whose estimate is the (unbiased) mean squared error.
The key point is that we can have p different sets of estimated regression coefficients,
since we can use the first eigenvector (or principal component), the first two eigenvectors or
all of them. If we use all of them, then we end up with the same regression coefficients as if
58
we performed a classical multiple regression analysis. Below we provide a code to perform
principal component regression using from one to all the principal components and each
time the following objects are calculated: estimated regression coefficients, their correspond-
ing standard errors, mean squared error (also plotted), adjusted R2 (also plotted). Note, that
the fitted values are calculated in the usual way, multiplying the independent variables (and
not the principal component scores) by their corresponding coefficients adding the mean of
the values of the dependent variable.
In addition, I have an option of estimation of new X values. If the new X values are the
same as the observed ones, then the classical fitted values will be returned. Note, that the
new X are scaled using the mean and standard deviation of the observed X values just like
I did in the kernel regression (function kern.reg).
59
com <- ( zzk * crossprod(z, y) )
a <- t(vec) * as.vector(com)
be <- t( Rfast::colCumSums(a) ) ## PCA based coefficients
est <- NULL
if ( length(k) == 1 ) be <- be[, k, drop = FALSE]
if (!is.null(xnew)) {
xnew <- matrix(xnew, ncol = p)
xnew <- t( ( t(xnew) - m ) / s )
est <- xnew %*% be ## predicted values for PCA model
}
nam <- colnames(x)
if ( is.null(nam) ) nam <- paste("X", 1:p, sep = "")
rownames(be) <- nam
colnames(be) <- paste("PC", k1, sep = "")
if ( !is.null(est) ) colnames(est) <- paste("PC", k1, sep = "")
list(be = be, per = per[k1], vec = vec, est = est)
}
The next function is more for a visualization and exploration, rather than inference. It
shows the adjusted R2 values and the cumulative proportion of the eigenvalues as a function
of the number of principal components.
y <- as.vector(y)
m <- mean(y)
y <- y - m ## standardize the dependent variable
n <- dim(x)[1]
p <- dim(x)[2]
60
for (i in 1:p){
zzk <- crossprod( z[, 1:i] )
b <- vec[, 1:i] %*% solve( zzk, crossprod( z[, 1:i], y ) )
yhat[, i] <- as.vector( m + x %*% b )
r2[i] <- 1 - (n - 1)/(n - i - 1) * (1 - cor(y, yhat[, i])^2)
}
plot( 1:p, r2, ylim = c(0, 1), type = ’b’, xlab = ’Number
of principal components’, ylab = ’Adjusted R-squared and
proportion of eigenvalues’)
points( 1:p, per, col = 2 )
lines( 1:p, per, col = 2, lty = 2 )
legend(p-2, 0.4, c(expression(paste("Adjusted", R^2, sep = " ")),
expression( lambda[i]/sum(lambda[i]) ) ), col = c(1, 2),
lty = c(1, 2), pch = c(1, 1))
We saw how to perform principal component regression. We can choose the number of
principal components based on the maximum adjusted R2 value or the minimized mean
squared error. If no maximum or minimum is met, we can keep the number of components
after which these quantities do not change significantly. Alternatively we can use an m-fold
cross validation with the TT estimate of bias.
If the option seed is true, then no matter how many times we repeat the analysis, the spit
between training and test samples is always the same and thus the results will be the same.
The same seed number is used in the functions knn.tune and kern.tune. Thus, the MSPE for
all three methods is directly comparable.
61
## ncores specifies how many cores to use
n <- length(y) ## sample size
p <- dim(x)[2] ## number of independent variables
if ( maxk > p ) maxk <- p ## just a check
if ( is.null(folds) )
folds <- Compositional::makefolds(y, nfolds = nfolds,
stratified = FALSE, seed = seed)
nfolds <- length(folds)
if (ncores <= 1) {
} else {
cl <- parallel::makePSOCKcluster(ncores)
doParallel::registerDoParallel(cl)
er <- numeric(maxk)
if ( is.null(folds) )
folds <- Compositional::makefolds(y, nfolds = nfolds,
stratified = FALSE, seed = seed)
msp <- foreach::foreach(vim = 1:nfolds, .combine = rbind,
.packages = c("Rfast", "Compositional") ) %dopar% {
ytest <- y[ folds[[ vim ]] ] ## test set dependent vars
ytrain <- y[ -folds[[ vim ]] ] ## train set dependent vars
xtrain <- x[ -folds[[ vim ]], , drop = FALSE] ## train set indep vars
62
xtest <- x[ folds[[ vim ]], , drop = FALSE] ## test set indep vars
est <- pcr(ytrain, xtrain, k = 1:maxk, xnew = xtest)$est
er <- Rfast::colmeans( (est - ytest)^2 )
return(er)
}
parallel::stopCluster(cl)
I decided to include the binary logistic and poisson regression in the principal components
regression. The function below does the same as pcr but with binary or count data (depen-
dent variable). In the case of count data, there is the option for the poisson distribution. In
any case the independent variables must be continuous. For the case of logistic regression
the reader is addressed to (Aguilera et al., 2006).
63
eig <- eigen( crossprod(x), symmetric = TRUE ) ## eigen analysis of the design matrix
values <- eig$values ## eigenvalues
per <- cumsum( values / sum(values) ) ## cumulative proportion of eigenvalues
vec <- eig$vectors ## eigenvectors, or principal components
z <- x %*% vec ## PCA scores
if ( length(unique(y)) == 2 ) {
oiko <- "binomial"
} else oiko <- "poisson"
if ( !is.null(xnew) ) {
xnew <- matrix(xnew, ncol = p)
s <- Rfast::colVars(x, std = TRUE)
xnew <- t( ( t(xnew) - m ) / s ) ## standardize the xnew values
es <- as.vector( xnew %*% be ) + b[1]
} else es <- as.vector( x %*% be ) + b[1]
if (oiko == "binomial") {
est <- as.vector( exp(es) / (1 + exp(es)) )
} else est <- as.vector( exp(es) ) ## fitted values for PCA model
Again a plot to visualize the principal components regression. It shows the deviance, the
percentage of the drop in the deviance and the cumulative proprtion of the eigenvalues as a
function of the number of principal components.
64
p <- dim(x)[2]
x <- Rfas::standardise(x) ## standardize the independent variables
eig <- eigen( crossprod(x), symmetric = TRUE ) ## eigen analysis of the design matrix
values <- eig$values ## eigenvalues
per <- cumsum( values / sum(values) ) ## cumulative proportion of eigenvalues
vec <- eig$vectors ## eigenvectors, or principal components
z <- x %*% vec ## PCA scores
devi <- numeric(p)
if ( length(unique(y)) == 2 ) {
oiko <- "binomial"
} else oiko <- "poisson"
for (i in 1:p){
mod <- glm(y ~ z[, 1:i], family = oiko )
b <- coef(mod)
be <- vec[, 1:i] %*% as.matrix( b[-1] )
es <- as.vector( x %*% be ) + b[1]
if (oiko == "binomial") {
est <- as.vector( exp(es) / (1 + exp(es)) )
devi[i] <- -2 * sum( y * log(est) + (1 - y) * log(1 - est) )
} else {
est <- as.vector( exp(es) ) ## fitted values for PCA model
devi[i] <- 2 * sum( y * log( y / est ), na.rm = TRUE )
## fitted values for the PCA model
}
}
dev <- (mod$null.deviance - devi)/mod$null.deviance
65
result <- rbind(devi, dev, per)
rownames(result) <- c(’Deviance’, ’% of deviance drop’, ’Cumul prop’)
colnames(result) <- paste(’PC’, 1:p, sep = ’ ’)
result
}
In order to tune the the number of principal components we want, we will use cross
validation. It is important now to define the error function we use. For example, in the
univariate case we saw before that is the mean squared error of prediction (MSPE). That is,
the sum of squares of the residuals divided by the test sample size. We will do the same
thing here, but, instead of the classical residuals we will calculate the deviance residuals
whose form depends upon the distribution used.
where si = 1 if yi = 1 and si = −1 if yi = 0.
glmpcr.tune <- function(y, x, nfolds = 10, maxk = 10, folds = NULL, ncores = 1,
seed = FALSE, graph = TRUE) {
## y is the UNIVARIATE dependent variable
## y is either a binary variable (binary logistic regression)
## or a discrete variable (Poisson regression)
## x contains the independent variables
## fraction denotes the percentage of observations
## to be used as the test set
## the 1-fraction proportion of the data will be the training set
## R is the number of cross validations
## if ncores==1, then 1 processor is used, otherwise more are
## used (parallel computing)
n <- dim(x)[1]
p <- dim(x)[2]
if ( maxk > p ) maxk <- p ## just a check
if ( is.null(folds) ) {
66
folds <- Compositional::makefolds(y, nfolds = nfolds,
stratified = FALSE, seed = seed)
}
nfolds <- length(folds)
msp <- matrix( nrow = nfolds, ncol = maxk )
## deigma will contain the positions of the test set
## this is stored but not showed in the end
## the user can access it though by running
## the commands outside this function
if ( length( Rfast::sort_unique(y) ) == 2 ) {
oiko <- "binomial"
} else oiko <- "poisson"
if (ncores <= 1) {
runtime <- proc.time()
for (vim in 1:nfolds) {
ytest <- y[ folds[[ vim ]] ] ## test set dependent vars
ytrain <- y[ -folds[[ vim ]] ] ## train set dependent vars
xtrain <- x[ -folds[[ vim ]], , drop = FALSE] ## train set independent vars
xtest <- x[ folds[[ vim ]], , drop = FALSE ] ## test set independent vars
vec <- prcomp(xtrain, center = FALSE)$rotation
z <- xtrain %*% vec ## PCA scores
for ( j in 1:maxk) {
if (oiko == "binomial") {
be <- Rfast::glm_logistic(z[, 1:j], ytrain)$be
} else {
be <- Rfast::glm_poisson(z[, 1:j], ytrain)$be
}
ztest <- xtest %*% vec[, 1:j, drop = FALSE] ## PCA scores
es <- as.vector( ztest %*% be[-1] ) + be[1]
if (oiko == "binomial") {
est <- as.vector( exp(es) / ( 1 + exp(es) ) )
ri <- -2 *( ytest * log(est) + (1 - ytest) * log(1 - est) )
} else {
est <- as.vector( exp(es) )
ri <- 2 * ytest * log(ytest / est)
}
67
msp[vim, j] <- sum( ri, na.rm = TRUE )
}
}
runtime <- proc.time() - runtime
} else {
runtime <- proc.time()
cl <- parallel::makePSOCKcluster(ncores)
doParallel::registerDoParallel(cl)
er <- numeric(maxk)
if ( is.null(folds) ) {
folds <- Compositional::makefolds(y, nfolds = nfolds,
stratified = FALSE, seed = seed)
}
msp <- foreach::foreach(vim = 1:nfolds, .combine = rbind,
.packages = "Rfast", .export = c("glm_logistic", "glm_poisson") ) %dopar% {
ytest <- y[ folds[[ vim ]] ] ## test set dependent vars
ytrain <- y[ -folds[[ vim ]] ] ## train set dependent vars
xtrain <- x[ -folds[[ vim ]], , drop = FALSE] ## train set independent vars
xtest <- x[ folds[[ vim ]], , drop = FALSE] ## test set independent vars
vec <- prcomp(xtrain, center = FALSE)$rotation
z <- xtrain %*% vec ## PCA scores
for ( j in 1:maxk) {
if (oiko == "binomial") {
be <- Rfast::glm_logistic(z[, 1:j], ytrain)$be
} else {
be <- Rfast::glm_poisson(z[, 1:j], ytrain)$be
}
ztest <- xtest %*% vec[, 1:j, drop = FALSE] ## PCA scores
es <- as.vector( ztest %*% be[-1] ) + be[1]
if (oiko == "binomial") {
est <- exp(es) / ( 1 + exp(es) )
ri <- -2 *( ytest * log(est) + (1 - ytest) * log(1 - est) )
} else {
est <- exp(es)
ri <- 2 * ytest * log(ytest / est)
}
68
er[j] <- sum( ri, na.rm = TRUE )
}
return(er)
}
stopCluster(cl)
runtime <- proc.time() - runtime
}
Ridge regression in the univariate case can be described as follows: minimize the sum of the
squared residuals subject to the sum of the squared beta coefficients is less than a constant
( )
n p p
min ∑ yi − α − ∑ β j x j subject to λ ∑ β2j ≤ s,
i =1 j =1 j =1
where n and p denote the sample size and the number of independent variables respectively.
If we do the derivatives by hand the formula for the beta coefficients is
ridge
−1
β
β̂ = X T X + λI p X T y,
where X contains the independent variables only, the first column is not the column of 1s.
It becomes clear that if λ = 0 we end up with the ordinary least squares (OLS) estimates.
The reason for ridge regression is multicollinearity. When multicollinearity among the
covariates (X), the term X T X will not be invertible and thus no OLS betas will be estimated.
Ridge regression is a regularised regression method because it regularises this matrix so that
it becomes invertible. Alternatively, one can use principal component regression we saw
before. The estimated betas will be biased, but at least we obtain an answer. If there is
no multicollinearity, ridge regression can still be used because ridge regression can lead to
69
better predicted values than the classical regression. In any case, the choice of the value of λ
is the key question.
In multivariate regression, the parameter λ becomes a matrix, but I saw that Brown and
Zidek (1980) use a scalar, so I will use a scalar also. The corresponding formula is the same,
but instead of the vectors fi and y we have matrices B and Y
−1
ridge T
B̂ = X X + λI p X T Y.
y <- as.vector(y)
if ( all( y > 0 & y < 1 ) ){
y <- log(y / ( 1 - y) ) ## logistic normal
}
70
yy <- y - my ## center the dependent variables
mx <- Rfast::colmeans(x)
s <- Rfast::colVars(x, std = TRUE)
xx <- ( x - rep(mx, rep(n, p) ) ) %*% diag(1 / s) ## standardize the independent variab
if ( !is.null(xnew) ) {
xnew <- matrix(xnew, ncol = p)
## scale the xnew values
xnew <- ( t(xnew) - rep(mx, rep(dim(xnew)[1], p) ) %*% diag( 1 / s)
est <- as.vector( xnew %*% beta + my )
71
} else est <- est
An alternative formula is given via Singular Value Decomposition (SVD) and this is what
I use here. We can write X, the matrix of the standardised independent variables as
X = UDV T ,
D is a diagonal matrix containing the singular values (square root of the eigenvalues). For
more information on SVD see Sections 6.5 and 6.10. The beta coefficients can be written as
dj
βλ = V U T Y.
d2j +λ
The next function calculates the ridge coefficients for a range of values of λ and plots
them. This will work only when the dependent variable is univariate.
y <- as.vector(y)
72
plot(lambda, be[1,], type = "l", col = 1, lty = 1,
ylim = c( min(be), max(be) ), xlab = expression(paste(lambda, " values") ),
ylab = "Beta coefficients")
for (i in 2:p) lines(lambda, be[i, ], col = i, lty = i)
The next R function uses cross validation to choose the value of λ that minimizes the
mean squared error of prediction, in the same way we did for the principal component, the
k-NN and the kernel regression implemented before. My suggestion for saving time is to
search for λ at big steps, for example. λ = 0, 0.5, 1, 1.5, 2.... Then see the plot where the
minimum is obtained (for example between 0.5 and 1) and redo the search at that region
with a smaller step, for example λ = 0.5, 0.51, 0.52, ..., 1.
73
M <- dim(mat)[2]
rmat <- dim(mat)[1]
msp <- matrix( nrow = M, ncol = k)
if (ncores == 1) {
runtime <- proc.time()
for (vim in 1:M) {
ytest <- as.matrix( y[mat[, vim], ] ) ## test set dependent vars
ytrain <- as.matrix( y[-mat[, vim], ] ) ## train set dependent vars
my <- Rfast::colmeans(ytrain)
yy <- t( t(ytrain) - my ) ## center the dependent variables
xtrain <- as.matrix( x[ -mat[, vim], ] ) ## train set independent vars
mx <- Rfast::colmeans(xtrain)
xtest <- as.matrix( x[ mat[, vim], ] ) ## test set independent vars
s <- Rfast::colVars(xtrain, std = TRUE)
xtest <- t( ( t(xtest) - mx ) / s ) ## standardize the xtest
xx <- t( ( t(xtrain) - mx ) / s ) ## standardize the independent variables
sa <- svd(xx)
d <- sa$d ; v <- t(sa$v) ; tu <- t(sa$u)
d2 <- d^2 ; A <- d * tu %*% yy
for ( i in 1:k ) {
## beta <- ( v %*% (tu * d / ( d^2 + lambda[i] ) ) ) %*% yy
beta <- crossprod( v / ( d2 + lambda[i] ), A )
est <- xtest %*% beta + my
msp[vim, i] <- sum( (ytest - est)^2 ) / rmat
}
}
runtime <- proc.time() - runtime
} else {
runtime <- proc.time()
74
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
pe <- numeric(k)
sa <- svd(xx)
d <- sa$d ; v <- t(sa$v) ; tu <- t(sa$u)
d2 <- d^2 ; A <- d * tu %*% yy
for ( i in 1:k ) {
## beta <- ( v %*% (tu * d / ( d^2 + lambda[i] ) ) ) %*% yy
beta <- crossprod( v / ( d2 + lambda[i] ), A )
est <- xtest %*% beta + my
pe[i] <- sum( (ytest - est)^2 ) / rmat
}
return(pe)
}
75
if ( graph ) {
plot(lambda, mspe, type = ’b’, ylim = c(min(mspe), max(mspe)),
ylab = "Mean squared error of prediction",
xlab = expression(paste(lambda, " values")) )
}
Fisher’s discriminant rule is a non parametric linear function. We need to find the first unit
eigenvector (usually called λ) (the eigenvector corresponding to the largest eigenvalue) of
the matrix W−1 B, where W and B are the within and between sum of squares matrices
respectively (Mardia et al., 1979, pg. 318-320). Then we use the mean of each group and the
λ to allocate a new observation using the decision algorithm below.
Allocate an observation z to group i iff
T T T T
λ z − λ x̄i = min λ z − λ x̄ j
1≤ j ≤ g
76
d <- dim(z)[2] ## dimensionality
znew <- as.matrix(znew)
znew <- matrix(znew, ncol = d)
nu <- dim(znew)[1]
ni <- tabulate(ina)
k <- length(ni) ## how many groups are there
zbar <- Rfast::colmeans(z)
mi <- rowsum(z, ina) / ni
B <- ni[1] * tcrossprod( mi[1, ] - zbar )
for (i in 2:k) B <- B + ni[i] * tcrossprod( mi[i, ] - zbar )
B <- B / (k - 1) ## the between sum of squares
m <- sqrt(n) * zbar
W <- crossprod(z) - tcrossprod(m) - B
M <- solve(W, B)
lambda <- as.vector( eigen(M)$vectors[, 1] ) ## Fisher’s discriminant
A <- as.vector( tcrossprod( lambda, znew ) )
A <- matrix(rep(A, each = k), nrow = nu, byrow = TRUE)
ma <- tcrossprod( lambda, mi)
crit <- abs( eachrow(A, ma, oper = "-") )
pred <- Rfast::rowMins(crit) ## the predicted group
list(lambda = lambda, pred = pred)
}
We have to note that in all cases the robust estimation of the covariance and or of the
location are available in within the MASS library. For the linear and quadratic discriminant
analysis that can happen automatically, by choosing the robust option. In the regularised
case, you will have to modify the estimates such that the robust estimates are obtained.
Another option is to use the estimates obtained from the t distribution. Bear in mind that
even though the univariate t distribution has some robustness properties, the multivariate t
distribution is not as robust as many people think. We show how to estimate the parameters
under this model later on. In all the other cases, we leave these changes to the interested
reader.
The function below estimates the performance of the Fisher classifier using k-leave-out
cross validation with either stratified or simple random sampling for the test set.
77
## be used as the test sample
## R is the number of cross validations
ina <- as.numeric(ina)
g <- max(ina) ## how many groups are there
n <- dim(z)[1] ## sample size
k <- round(fraction * n) ## test set sample size
ni <- tabulate(ina)
num <- 1:n
p <- ni/n
esa <- as.vector( round(p * k) )
k <- sum(esa) ## test set sample size
p <- numeric(R)
deigma <- matrix(nrow = R, ncol = k)
## if seed==TRUE then the results will always be the same
if ( seed ) set.seed(123456)
for (i in 1:R) {
for (j in 1:g) {
ta <- sample( num[ina == j], esa[j] )
deigmata[j, ] <- c( ta, numeric( max(esa) - length(ta) ) )
}
deigma[i, ] <- as.vector( t(deigmata) )
}
} else {
for (i in 1:R) deigma[i, ] <- sample(1:n, k)
}
for ( i in 1:R ) {
xtest <- z[ deigma[i, ], ]
xtrain <- z[ -deigma[i, ], ]
gtrain <- ina[ -deigma[i, ] ]
gtest <- ina[ deigma[i, ] ]
est <- fisher.da(xtest, xtrain, gtrain)$pred
p[i] <- sum(est == gtest) / k
78
}
4.3.2 Repeated cross validation for linear and quadratic discriminant analysis
The built in functions in R for linear and quadratic discriminant analysis offer 1-fold cross
validation. This function uses these built in functions to extent to the repeated cross vali-
dation. The user specifies the value of the percentage for the splits and then the function
removes this percentage (test sample) at random. It performs discriminant analysis for the
remaining values (training sample) and then classifies the test sample. This is performed
by default R = 1000 and in the end an estimate of the distribution of the error is available.
Thus, we can construct 3 types of confidence intervals. The first two use the standard ap-
proach where the standard deviation is calculated from the R = 1000 repetitions and via the
binomial distribution. The third one is an empirical (or percentile) one, since it uses the 2.5%
upper and lower quantiles of the distribution of the error. This function is more to train the
two methods (linear and quadratic discriminant analysis) and see how well each of them
performs. The bottom line is to select one over the other.
79
## fraction denotes the percentage of the sample to
## be used as the test sample
## R is the number of cross validations
## method denotes whether "lda" or "qda" is to be used
p <- numeric(R)
n <- dim(x)[1]
ina <- as.factor(ina)
k <- round(fraction * n) ## test sample size
## if seed==TRUE then the results will always be the same
if ( seed ) set.seed(1234567)
80
colnames(ci) <- c("2.5%", "97.5%")
rownames(ci) <- c("standard", "binomial", "empirical")
list(percentage = per, ci = ci)
We will show a simple procedure for model selection in quadratic discriminant analysis. the
R code given below is made for quadratic discriminant analysis but with a simple modifica-
tion it can be applied to linear discriminant analysis as well.
It utilizes the function kfold.da where the split is 80% and 20% for the training and the
test set respectively. The number of cross validations is set 500 and always the splits are
the same. But as I mentioned before, this input parameters can change easily within the
function.
The idea is simple and similar to the stepwise variable selection in multiple regression
analysis. Below is the algorithm explained.
1. Perform discriminant analysis bases on one variable only. The first chosen variable is
the one with the highest estimated rate of correct classification.
2. Next, we look for the second best variable. We try all of them (now we have two
variables included) and keep the variable, which combined with the first one, leads to
the highest estimated rate of correct classification.
4. We stop when the difference between two successive rates is less than or equal to a
tolerance level (taken to be 0.001 or 0.1%).
There can be two cases, a) the rate keeps increasing by adding more variables. The toler-
ance level will prevent from adding more variables than necessary. And b) the rate at some
point will decrease. The tolerance level will see the change and will terminate the process.
For this reason I use a while function.
This is a simple model selection procedure and a faster one would be via the BIC. I am
just giving a method here and my purpose is to motivate the interested reader in learning
more about it. Also to make the reader aware of the model selection process in discriminant
analysis.
81
select.da <- function(x, ina, tol = 0.001) {
## x contains the data
## ina is the group indicator variable
## tol is the stopping difference between two successive rates
p <- dim(x)[2]
per <- numeric(p)
## STEP 1
est <- numeric(p)
z <- NULL
for (j in 1:length(est)) {
z1 <- x[, j]
est[j] <- kfold.da(z1, ina, fraction = 0.2, R = 100,
method = "qda", seed = TRUE)$percentage
}
per[1] <- max(est)
id <- which.max(est)
z <- cbind(z, x[, id])
z1 <- x[, -id]
## STEP 2
est <- numeric(p - 1)
for (j in 1:length(est)) {
z2 <- z1[, j]
est[j] <- kfold.da(cbind(z, z2), ina, fraction = 0.2, R = 100,
method = "qda", seed = TRUE)$percentage
}
per[2] <- max(est)
id <- which.max(est)
z <- cbind(z, z1[, id])
z1 <- z1[, -id]
## STEP 3 AND BEYOND
i <- 2
while (per[i] - per[i - 1] > tol) {
i <- i + 1
est <- numeric(p - i + 1)
for (j in 1:length(est)) {
z2 <- as.matrix(z1[, j])
est[j] <- kfold.da(cbind(z, z2), ina, fraction = 0.2, R = 100,
method = "qda", seed = TRUE)$percentage
}
82
per[i] <- max(est)
id <- which.max(est)
z <- cbind(z, z1[, id])
z1 <- as.matrix(z1[, -id])
}
per <- per[per > 0]
plot(per, type = "b", xlab = "Number of variables",
ylab = "Estimated correct rate")
list(percentage = per, vars = z)
}
We will use the Box-Cox transformation as an additional feature which can lead to better
classification results. This power transformation is defined as
( )
x λ −1
if λ 6= 0
y (λ) = λ
log x if λ = 0
Note that the x has to have strictly positive values if one uses the logarithm. When λ 6= 0
this is not an issue, but if there are zero values, then λ has to be strictly positive. The R code
presented below is a simple one. The first step is to apply the Box-Cox transformation for a
value of λ and then use the function kfold.da we saw before. This is repeated for a range of
values of λ and every time the estimated percentage of correct classification is saved. A plot
is also created for graphical visualization of the estimated percentage of correct classification
as a function of λ.
83
## Next is the Box-Cox transformation depending on the value of lambda
if (lambda[i] != 0) y <- (x ^ lambda[i] - 1) / lambda[i]
if (lambda[i] == 0) y <- log(x)
mod <- kfold.da(x = y, ina = ina, fraction = fraction, R = R,
method = method, seed = TRUE)
percent[i] <- mod$percentage
conf1[i, ] <- mod$ci[1, ]
conf2[i, ] <- mod$ci[2, ]
conf3[i, ] <- mod$ci[3, ]
names(percent) <- lambda
plot(lambda, percent, ylim = c( min(conf3[, 1]), max(conf3[, 2]) ),
type = ’b’, col = 3, xlab = expression( paste(lambda," values") ),
ylab = ’Estimated percentage of correct classification’)
lines(lambda, conf3[, 1], lty = 2, lwd = 2, col = 2)
lines(lambda, conf3[, 2], lty = 2, lwd = 2, col = 2)
## the plot contains the 3rd type confidence limits also
rownames(ci)<- lambda
colnames(ci) <- c(’2.5%’, ’97.5%’)
## I show only the the third type of confindence intervals
list(percentage = percent, ci = conf3)
}
Linear and quadratic discriminant analyses can be thought of as special cases of what is
called regularised discriminant analysis denoted by RDA(δ, γ) (Hastie et al., 2001). The
discriminant analysis in general has a rule. Every vector z is allocated to the group for
which the density of the vector calculated using the multivariate normal is the highest. The
algorithm is as follows
• Allocate z to the group for which the above quantity takes the highest value.
The f i ( x ) is assumed a multivariate normal and πi = ni /n, where ni is the sample size of
the i-th group and n = n1 + ... + n g is the total sample size. The πi plays the role of the prior,
thus making the rule a naive Bayes classifier. Alternatively the first step of the algorithm
can be substituted by the logarithm of the density
1 1
µ i ) T Si−1 (z − µ̂
ξ i (z) = − log |Si | − (z − µ̂ µ i ) + log πi ,
2 2
84
The vector z is allocated to the group with the highest value ξ i (z). The idea of RDA(δ, γ) is
to substitute the covariance matrix for each group (Si ) by a weighted average
where S (γ) = γS p + (1 − γ) s2 Id
g
∑ ( n i − 1) S i
S p = i =1
n−g
The regularization of the pooled covariance matrix (S p ) is the one mentioned in Hastie et al.
trS
(2001). They used (s2 I), where s2 = d p and d is the number of dimensions. Thus we end up
with a general family of covariance matrices which is regularised by two parameters δ and
γ each of which takes values between 0 and 1. When δ = 1 then we end up with QDA, and
if δ = 0 and γ = 1 we end up with LDA. The posterior probabilities of group allocation are
calculated as follows
π j f j ( zi )
P zi ∈ group j ξ j (zi ) = g ,
∑ l =1 π l f l ( z i )
The code presented below accepts new observations and predicts their groups, for a given
value of γ and λ.
85
xnew <- as.matrix(xnew)
xnew <- matrix(xnew, ncol = D)
nu <- dim(xnew)[1] ## number of the new observations
ina <- as.numeric(ina)
ng <- tabulate(ina)
nc <- length(ng)
ta <- matrix(nrow = nu, ncol = nc)
ci <- 2 * log(ng / n)
sk <- vector("list", nc)
mesos <- rowsum(x, ina) / ng
sa <- 0
for (i in 1:nc) {
xi <- x[ina == i, ]
m <- sqrt(ng[i]) * mesos[i, ]
sk[[ i ]] <- ( crossprod(xi) - tcrossprod(m) )
sa <- sa + sk[[ i ]]
sk[[ i ]] <- sk[[ i ]] / (ng[i] - 1)
}
Sp <- sa/(n - nc)
sp <- diag( sum( diag( Sp ) ) / D, D ) ## spherical covariance matrix
Sa <- gam * Sp + (1 - gam) * sp ## regularised covariance matrix
for (j in 1:nc) {
Ska <- del * sk[[ j ]] + (1 - del) * Sa
ta[, j] <- ci[j] - log( det( Ska ) ) - Rfast::mahala( xnew, mesos[j, ], Ska )
## the scores are doubled for efficiency, i did not multiply with 0.5
}
We now how how to tune the parameters of the regularised discriminant analysis. The
idea is similar to all the techniques we have seen in this Section. The bias correction estimate
of Tibshirani and Tibshirani (2009) is again applied.
86
stratified = TRUE, seed = FALSE) {
## x contains the data
## gam is between pooled covariance and diagonal
## gam*Spooled+(1-gam)*diagonal
## del is between QDA and LDA
## del*QDa+(1-del)*LDA
## if ncores==1, then 1 processor is used, otherwise more are
## used (parallel computing)
## if a matrix with folds is supplied in mat the results will
## always be the same. Leave it NULL otherwise
ina <- as.numeric(ina)
n <- dim(x)[1] ## total sample size
nc <- max(ina) ## number of groups
D <- dim(x)[2] ## number of variables
sk <- array( dim = c(D, D, nc) )
lg <- length(gam) ; ld <- length(del)
if ( is.null(folds) ) folds <- Compositional::makefolds(ina,
nfolds = nfolds, stratified = stratified, seed = seed)
nfolds <- length(folds)
if (ncores > 1) {
runtime <- proc.time()
group <- matrix(nrow = length(gam), ncol = length(del) )
cl <- parallel::makePSOCKcluster(ncores)
doParallel::registerDoParallel(cl)
if ( is.null(folds) ) folds <- Compositional::makefolds(ina,
nfolds = nfolds, stratified = stratified, seed = seed)
ww <- foreach(vim = 1:nfolds, .combine = cbind,
.export = c("mahala", "rowMaxs"), .packages = "Rfast") %dopar% {
test <- x[ folds[[ vim ]], , drop = FALSE] ## test sample
id <- ina[ folds[[ vim ]] ] ## groups of test sample
train <- x[ -folds[[ vim ]], ] ## training sample
ida <- ina[ -folds[[ vim ]] ] ## groups of training sample
na <- tabulate(ida)
ci <- 2 * log(na / sum(na) )
mesi <- rowsum(train, ida) / na
na <- rep(na - 1, each = D^2)
## the covariance matrix of each group is now calculated
for (m in 1:nc) sk[ , , m] <- Rfast::cova( train[ida == m, ] )
87
s <- na * sk
Sp <- colSums( aperm(s) ) / (sum(na) - nc) ## pooled covariance
sp <- diag( sum( diag( Sp ) ) / D, D )
gr <- matrix(nrow = length( folds[[ vim ]] ), ncol = nc)
for ( k1 in 1:length(gam) ) {
Sa <- gam[k1] * Sp + (1 - gam[k1]) * sp ## regularised covariance
for ( k2 in 1:length(del) ) {
for (j in 1:nc) {
Ska <- del[k2] * sk[, , j] + (1 - del[k2]) * Sa
gr[, j] <- ci[j] - log( det( Ska ) ) -
Rfast::mahala( test, mesi[j, ], Ska )
## I did not multiply with 0.5 for efficiency
}
g <- Rfast::rowMaxs(gr)
group[k1, k2] <- mean( g == id )
}
}
return( as.vector( group ) )
}
stopCluster(cl)
} else {
runtime <- proc.time()
per <- array( dim = c( lg, ld, nfolds ) )
88
na <- rep(na - 1, each = D^2)
## the covariance matrix of each group is now calculated
for (m in 1:nc) sk[ , , m] <- Rfast::cova( train[ida == m, ] )
s <- na * sk
Sp <- colSums( aperm(s) ) / (sum(na) - nc) ## pooled covariance
sp <- diag( sum( diag( Sp ) ) / D, D )
gr <- matrix(nrow = length( folds[[ vim ]] ), ncol = nc)
for ( k1 in 1:length(gam) ) {
Sa <- gam[k1] * Sp + (1 - gam[k1]) * sp ## regularised covariance
for ( k2 in 1:length(del) ) {
for (j in 1:nc) {
Ska <- del[k2] * sk[, , j] + (1 - del[k2]) * Sa
gr[, j] <- ci[j] - log( det( Ska ) ) -
Rfast::mahala( test, mesi[j, ], Ska )
## I did not multiply with 0.5 for efficiency
}
g <- Rfast::rowMaxs(gr)
per[k1, k2, vim] <- mean( g == id )
}
}
}
runtime <- proc.time() - runtime
}
89
4.3.6 Discriminant analysis with mixed data
In all the previous cases we saw how to discriminate between groups containing contin-
uous data but we did not mention the case of mixed (categorical and continuous) or only
categorical data. This problem is solved, obviously, by employing the multinomial regres-
sion (Agresti, 2002). In fact, the multinomial regression is used for the case of multinomial
responses, not just 0 or 1, not binary, but with more outcomes. It is the generalization of the
binomial distribution. From bin(omial) we go to multin(omial). The mass function of this
discrete distribution is
n! p p
P (Y1 = y1, . . . , Yk = yk ) = y1 1 . . . y k k ,
y1 ! . . . y k !
where ∑ik=1 yi = n and ∑ik=1 pi = 1. So, in our case, each Yi can take one value at the time,
one outcome. Just like in the binomial distribution we can link the probabilities pi to some
independent variables
1
Xβ
βj if i = 1
1+∑kj=2 e
pi = e Xβ
βi .
Xβ
βj if i = 2, . . . , k
1+∑kj=2 e
Again, just like in binary regression we do multinomial regression. But, it happens, as usu-
ally in statistics, that this is exactly what we want to do here also. For every observation,
the fitted values is a vector of k elements, probabilities. We assign that observation to an
outcome (or to a group) according to the highest probability. If a fitted value for example
has this form ŷi = (0.1, 0.4, 0.5), then we assign to ŷi the value of 3, (or the third outcome).
Bear in mind and note one important thing. The outcomes are nominal, i.e. the ordering
is not important. We do not say, for example good, better, best. But we say, red, yellow, orange.
The ordering of the outcomes is totally arbitrary and has no effect on the data. One more
thing is that this method can be used when you have continuous data only, as well. It
is the most flexible of all with respect to the data used. In order to implement this type of
regression (for discrimination purposes) you could use of the VGAM package. Alternatively,
you can use (and this is what I use) the nnet package.
90
## data it will still be turned into a data frame
y <- as.factor(y)
p <- dim(x)[2]
if ( is.null(colnames(x)) ) colnames(x) <- paste("X", 1:p, sep = "")
mod <- nnet::multinom(y ~ . , data = as.data.frame(x), trace = FALSE)
if ( !is.null(xnew) ) {
xnew <- as.data.frame(xnew)
colnames(xnew) <- colnames(x)
est <- predict(mod, xnew)
} else {
probs <- fitted(mod)
est <- Rfast::rowMaxs(probs)
}
list(suma = summary(mod), est = est)
}
In order to estimate its performance I have written another code which uses parallel
computing to speed up the cross validation procedure.
if (ncores > 1) {
runtime <- proc.time()
91
if ( seed ) set.seed( 12345 )
require(doParallel, quiet = TRUE, warn.conflicts = FALSE)
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
ba <- rep(R/ncores, ncores)
p <- numeric(ba[1])
ww <- foreach(j = 1:ncores, .combine = cbind, .packages = "nnet",
.export= "mrda") %dopar% {
for (vim in 1:ba[j]) {
if ( strata ) {
deigma <- matrix(nrow = g, ncol = max(esa))
for (i in 1:g) {
ta <- sample(num[ina == i], esa[i])
deigma[i,] <- c( ta, numeric( max(esa) - length(ta) ) )
}
deigma <- as.vector(deigma)
} else deigma <- sample(1:n, k)
xtest <- x[deigma, ] ## test sample
id <- ina[deigma] ## groups of test sample
xtrain <- x[-deigma, ] ## training sample
ytrain <- ina[-deigma] ## groups of training sample
est <- mrda(ytrain, xtrain, xtest)$est
p[vim] <- sum(est == id)/k
}
return(p)
}
stopCluster(cl)
p <- as.vector(ww)
runtime <- proc.time() - runtime
} else {
runtime <- proc.time()
if ( seed ) set.seed( 12345 )
for ( vim in 1:R ) {
if ( strata ) {
deigma <- matrix(nrow = g, ncol = max(esa))
for (i in 1:g) {
ta <- sample(num[ina == i], esa[i])
deigma[i,] <- c( ta, numeric( max(esa) - length(ta) ) )
92
}
deigma <- as.vector(deigma)
} else deigma <- sample(1:n, k)
xtest <- x[deigma, ] ## test sample
id <- ina[deigma] ## groups of test sample
xtrain <- x[-deigma, ] ## training sample
ytrain <- ina[-deigma] ## groups of training sample
est <- mrda(ytrain, xtrain, xtest)$est
p[vim] <- sum(est == id)/k
}
runtime <- proc.time() - runtime
}
Suppose we have multinomial data and we know the different populations from which
they come. An example to give is the text analysis. You have some documents of some
authors. For each of the document you know how many sentences he/she has written and
consequently you know the number of words with 1, 2, 3, 4, 5 and 6 or more syllables. So
for example, a document of the author X has 100, 50, 150, 300, 100, 250 words of 1, 2, 3, 4, 5
and 6 or more syllables, with a total of 950 words. Imagine now you have this kind of
information for all his documents and for all the documents of the other authors. The target
is to discriminate between the authors. The difference now is that instead of continuous
data, we have discrete data.
Another example to think of this distribution is via the binomial. In the binomial dis-
tribution you have two possible scenarios, in the multinomial you have three or more. In
football for example, you may loose, win or draw. Another example is the voters. There are
20% of the voters who would vote for party A, 30% for party B, 25% for party C and 25% for
party D. We pick at random 80 people. What is the probability that 30 will vote for party A,
20 for party B, 15 for party C and 15 for party D?
I will use the multinomial distribution for this purpose
n!
f ( x1 , . . . , x K ; p1 , . . . , p K ) = p x1 . . . pKxK ,
x1 ! . . . x K ! 1
93
where ∑iK=1 xi = n and ∑iK=1 pi = 1 and K denotes the number of categories. If K = 2
we obtain the binomial distribution. Given that we have an n × K matrix X of data. The
estimated probabilities pi are given by
∑nj=1 xij
p̂i = , for i = 1, . . . , K.
n
Suppose now that we have a new vector x = ( x1 , . . . , xK ) and we want to know where
to allocate it, did it come from a multinomial population with parameters (probabilities)
(α1 , . . . , αK ) or to a population with parameters ( β 1 , . . . , β K )?
After estimating the parameters of each of the two populations, we calculate the score
function
K
α̂
δ= ∑ xi log β̂i .
i =1 i
If λ > 0, we allocate x e to the first population and to the second otherwise. When we have
more than two populations, then we calculate λ j = ∑iK=1 xi log α̂ij for each of the g groups
and we allocate x to the group with the highest score function λ j .
Alternatively, you may assume that each variable (or category) is independent from each
other and fit a product of Poisson distributions. That is, for each category fit a separate
Poisson distribution. Hence, assume that x ∼ PP (λ1 , . . . , λK )
K x
λi i
f ( x1 , . . . , x K ; λ1 , . . . , λ K ) = ∏ e − λi xi !
i =1
The PP and the multinomial will not differ much if the data are not overdispersed, i.e.
if they do come from a multinomial distribution. If there is overdispersion though, then a
better alternative is the Dirichlet-multinomial, which is presented in §5.1.9, but included as
an option for discrimination. The function multinom.da requires the group samples as they
are, the grouping information, the new data to be classified and the type of the distribution
used.
p <- dim(x)[2]
xnew <- matrix( xnew, ncol = p )
94
x <- as.matrix(x) ## makes sure x is a matrix
ina <- as.numeric(ina)
nu <- tabulate(ina)
g <- length( nu )
if (type == "multinom") {
## normalize the data, so that each observation sums to 1
y <- x / Rfast::rowsums(x)
m <- rowsum(y, ina) / nu
score <- tcrossprod( xnew, log(m) )
} else {
m <- matrix(nrow = g, ncol = p )
for (i in 1:g) {
A <- x[ina == i, ]
m[i, ] = dirimultinom(A)$para
score[, i] <- lgamma( sum(m[i, ]) ) - lgamma( rowSums(A) + sum(m[i, ]) ) +
lgamma( A + rep( m[i, ], rep(nu[i], p) ) ) - sum( lgamma( m[i, ] ) )
}
}
Rfast::rowMaxs(score)
The function multinom.da requires the group samples as they are and the grouping infor-
mation. The other arguments are relevant to the replications. It will give an estimate of the
rate of correct classification. Of course, different test set sizes will give different percentages.
95
## the remaining data belong to the training set
## R is the number of cross validations to be performed
## type is either "multinom", "poisson" or "dirmult"
for (i in 1:R) {
test <- x[deigma[i, ], ]
id <- ina[deigma[i, ] ]
train <- x[-deigma[i, ], ]
ina2 <- ina[-deigma[i, ] ]
est <- multinom.da(test, train, ina2, type = type)
p[i] <- sum(est == id)/nu
}
96
per <- sum(p) / R
su <- sd(p)
conf <- quantile(p, probs = c(0.025, 0.975))
runtime <- proc.time() - runtime
list(per = per, su = su, conf = conf, runtime = runtime)
97
5 Distributions
5.1 Maximum likelihood estimation
5.1.1 Kullback-Leibler divergence between two multivariate normal populations
The Kullbacvk-Libler divergence (Kullback, 1997) between two multivariate normal popu-
lations in Rd is equal to
|Σ 1 |
1 −1
T −1
KL ( MN1 || MN2 ) = tr Σ 2 Σ 1 + (µ 2 − µ 1 ) Σ 2 (µ 2 − µ 1 ) − log −d ,
2 |Σ 2 |
d
1 1 T −1
Γ (log y−ν ) 1
f (y) =
Γ| 1/2
e− 2 (log y−ν ) ∏ yi ,
|2πΓ i =1
where final bit is the Jacobian of the transformation and ν and Γ are the mean vector and
covariance matrix of the logarithmically transformed data, of the multivariate normal dis-
tribution. The logarithm of the vector is taken component-wise log y = (log y1 , . . . , log yd )
and 0 < yi < ∞ for i = 1, . . . , d.
The maximum likelihood estimators of the mean and the covariance of a multivariate
98
log-normal distribution are given by
Γ 11
ν1 +0.5Γ Γ dd
νd +0.5Γ
E (Y) = µ = ( µ1 , . . . , µ d ) = e ,...,e and
Var (Y) = Σ ,
νi +νj +0.5(Γ ii +Γ jj )
Γ ij
where Σ ij = e e −1 .
if ( is.null(colnames(x)) ) {
names(m) = colnames(s) = rownames(s) = paste("X", 1:d, sep = "")
} else names(m) = colnames(s) = rownames(s) = colnames(x)
list(m = m, s = s)
}
where the parameter ν is called degrees of freedom and the the mean vector and variance
matrix are defined as follows
Numerical optimization is again required to estimate the parameters and we have to say
99
that in the special case of ν = 1, the distribution is called multivariate Cauchy. The MASS
library in R offers estimation of the mean vector and covariance matrix of this distribution
for specific degrees of freedom. We have extended the cov.trob command to incorporate the
degrees of freedom and end up with the maximum likelihood estimates for all the parame-
ters.
The function will return the location and scatter matrix of the multivariate t distribution
along with the degrees of freedom (ν) and also the classical mean vector and covariance
matrix, which essentially are calculated assuming a multivariate normal. There is an option
to construct asymptotic 95% confidence intervals for the degrees of freedom. If the argument
plot is TRUE, the confidence intervals are presented graphically.
100
lik <- deg <- seq(max(1, df - 20), df + 20, by = 0.1)
for (i in 1:length(deg)) lik[i] <- mvt(y, deg[i])
plot(deg, lik, type = "l", xlab = "Degrees of freedom",
ylab = "Log likelihood")
b <- max(lik) - 1.92
abline(h = b, col = 2)
a1 <- min(deg[lik >= b])
a2 <- max(deg[lik >= b])
abline(v = a1, col = 3, lty = 2)
abline(v = a2, col = 3, lty = 2)
conf <- c(a1, a2)
names(conf) <- c("2.5%", "97.5%")
apotelesma <- list(center = result$center, scatter = result$cov,
df = dof, conf = conf, loglik = loglik, mesos = Rfast::colmeans(y),
covariance = cov(y))
}
apotelesma
}
Nadarajah and Kotz (2008) stated a few methods for estimating the scatter and location
parameters of the multivariate t distribution. One of them is the maximum likelihood esti-
mation via the EM algorithm. The log-likelihood, ignoring constant terms is written as
n ν+p n
2 i∑
` (µ , Σ , ν) = − log |Σ | − log (ν + si ),
2 =1
where n and p denote the sample size and number of dimensions respectively, ν are the
degrees of freedom and si = (yi − µ ) T Σ −1 (yi − µ ). Differentiating with respect to µ and Σ
leads to the following estimating equations
∑in=1 wi yi
µ = and
∑in=1 wi
T
∑in=1 wi (yi − µ ) (yi − µ )
Σ = ,
∑in=1 wi
ν+p
where wi = .
ν + ( y i − µ ) T Σ −1 ( y i − µ )
Repeating the three equations above iteratively until the full log-likelihood does not
change, is what the function below does.
101
## x contains the data
## v is the degrees of freedom, set to 5 by default
dm <- dim(x)
p <- dm[2] ; n <- dm[1] ## dimensions
if ( v == 1 ) {
R <- cov(x)
} else R <- abs( v - 1 ) / v * cova(x)
m <- Rfast::colmeans(x) ## initial parameters
con <- n * lgamma( (v + p)/2 ) - n * lgamma(v/2) - 0.5 * n * p * log(pi * v)
### step 1
wi <- (v + p) / ( v + Rfast::mahala(x, m, R) ) ## weights
y <- sqrt(wi) * ( Rfast::eachrow(x, m, oper = "-" ) )
sumwi <- sum(wi)
R <- crossprod(y) / sumwi ## scatter estimate
m <- Rfast::colsums(wi * x) / sumwi ## location estimate
dis <- Rfast::mahala(x, m, R)
el1 <- - n * log( det(R) ) - (v + p) * sum( log1p(dis/v) )
### step 2
wi <- (v + p) / ( v + dis ) ## weights
y <- sqrt(wi) * Rfast::eachrow(x, m, oper = "-" )
sumwi <- sum(wi)
R <- crossprod(y) / sumwi ## scatter estimate
m <- Rfast::colsums(wi * x) / sumwi ## location estimate
dis <- Rfast::mahala(x, m, R)
el2 <- - n * log( det(R) ) - (v + p) * sum( log1p(dis/v) )
## Step 3 and above
i <- 2
while ( el2 - el1 > tol ) { ## 1e-06 is the tolerance level
## between two successive values of the log-likelihood
i <- i + 1
el1 <- el2
wi <- (v + p) / ( v + dis) ## updated weights
y <- sqrt(wi) * Rfast::eachrow(x, m, oper = "-" )
sumwi <- sum(wi)
R <- crossprod(y) / sumwi ## updated scatter estimate
m <- Rfast::colsums(wi * x) / sumwi ## updated location estimate
dis <- Rfast::mahala(x, m, R)
el2 <- - n * log( det(R) )- (v + p) * sum( log1p(dis/v) )
} ## updated log-likelihood
102
list(iters = i, loglik = 0.5 * el2 + con, location = m, scatter = R)
}
I wrote ”a”, instead of ”the” multivariate Laplace distribution because I am bit confused
with its many representations. The one I will use here can be found in Eltoft et al. (2006)
q
2
K p/2−1 λ q (y)
2
f p (y) = ( p/2)−1 , (5.2)
λ (2π ) p/2
q
2
λ q (y)
where p stands for the number of variables and q (y) = (y − µ ) T Γ −1 (y − µ ). The matrix Γ
is a covariance structure matrix, but it has the constraint that is determinant is 1, det (Γ ) = 1.
Km ( x ) denotes the modified Bessel function of the second kind and order m, evaluated at x.
The support of this distribution is the whole of R p .
I will only mention the first way of obtaining (moment) estimates for the parameters µ ,
Γ and λ of (5.2). The sample mean is the estimate of the location parameter, µ̂ µ = ȳ. Let R̂ =
1 n T
n−1 ∑i =1 ( y − µ ) ( y − µ ) be the unbiased sample covariance matrix of the n observations
yi . Eltoft et al. (2006) divide by n and not n − 1. I did some simulations and saw that the
estimate of λ changes slightly, but the MSE and bias are slightly better with the unbiased
covariance matrix. Since det (Γ ) = 1, λ̂ = [det (R)]1/p and thus Γ̂ Γ = 1 R̂.
λ̂
103
5.1.5 Estimation of the parameters of an inverted Dirichlet distribution
p
A vector x ∈ R+ follows the inverted Dirichlet distribution if its density is
p α −1
1 ∏ i =1 x i i
f (x; α ) = ,
B (α ) p ∑ jp=+11 α j
1 + ∑ i =1 x i
where
p +1
∏ Γ ( αi )
B (α ) = i=1p+1 and α = (α1 , . . . , α p , α p+1 ).
Γ ∑ i =1 α i
All we do in the next R function is to write down the log-likelihood of the inverted
Dirichlet density and given a sample to maximise it with respect to the vector of α using
optim.
104
da <- nlm(loglik, c( log( Rfast::colmeans(x) ), log( max(min(x), 3) ) ),
zx = zx, rsx = rsx, n = n, d = d, iterlim = 2000)
da <- nlm( loglik, da$estimate, zx = zx, rsx = rsx, n = n, d = d,
iterlim = 2000 )
da <- optim( da$estimate, loglik, zx = zx, rsx = rsx, n = n, d = d,
control = list(maxit = 20000), hessian = TRUE )
lik = -da$value ; a <- exp(da$par)
mu <- NULL
s <- NULL
if (a[d + 1] > 1) {
mu <- a[1:d]/(a[d + 1] - 1)
} else {
mu <- paste("Mean vector does not exist")
s <- paste("Covariance matrix does not exist")
}
if (a[d + 1] > 2) {
mu <- a[1:d]/(a[d + 1] - 1)
s <- matrix(nrow = d, ncol = d)
for (i in 1:d) {
for (j in 1:d) {
down <- (a[d + 1] - 1)^2 * (a[d + 1] - 2)
if (i == j)
s[i, j] <- a[i] * (a[i] + a[d + 1] - 1)/down
if (i != j)
s[i, j] <- a[i] * a[j]/down
}
}
105
5.1.6 Multivariate kernel density estimation
We saw kernel regression before but now we will see multivariate kernel density estimation.
The kernel to be used is again that of a multivariate standard normal distribution
1 T
e− 2 x x
K (x) = ,
(2π ) p/2
where p is the dimensionality of the data respectively. This leads to the following kernel
estimator Wand and Jones (1995)
n
1 h i
fˆ (x; H) = p
n ∏ j=1 H1/2
∑ K H −1/2
( x − X i )
jj i =1
T
1 n 1
e − 2 ( x − Xi ) H −1 ( x − X i )
= p ∑ , (5.3)
n ∏ j=1 H1/2
jj i =1 (2π ) p/2
where n is the sample size and h is the bandwidth parameter to be chosen. I have assumed
that H is a p × p diagonal matrix, where H = h2 Iq , where Iq is the identity matrix for all
i = 1, . . . , p. In the next R function the user can either specify his own diagonal bandwidth
matrix H, a matrix with the same elements in the diagonal, so give only a scalar or use a rule
of thumb, either Silverman’s or Scott’s. The Silverman and the Scott rules are
1
4 p +4 −1
H1/2
ii = n p+4 σi and
p+2
−1
H1/2
ii = n p+4 σi respectively,
where σi is the standard deviation of the i-th variable. The idea is that for every x one must
calculate (5.3) n times, for all the observations in the sample. The following function does
that
if ( thumb == "silverman" ) {
s <- Rfast::colVars(x, std = TRUE)
h <- ( 4/(d + 2) )^( 1/(d + 4) ) * s * n^( -1/(d + 4) )
106
} else if ( thumb == "scott" ) {
s <- Rfast::colVars(x, std = TRUE)
h <- s * n^( -1/(d + 4) )
} else if ( thumb == "estim" ) {
h <- mkde.tune(x)$hopt
} else h <- h
if ( length(h) == 1 ) {
h <- diag( 1 / h, d )
} else h <- diag( 1 / h)
How does one choose h? What I have done here is maximum likelihood cross validation
(Guidoum, 2015). Actually this way was proposed by Habbema et al. (1974) and Robert
(1976). The cross validation idea is to replace fˆ ( x; h) in (5.3) by the leave-one-out estimator
n
xi − X j
1
fˆ−i (xi ; h) =
( n − 1) h p ∑K h
. (5.4)
j 6 =i
1 n h i
MLCV (h) = ∑
n i =1
log fˆ−i (xi ; h) . (5.5)
Wand and Jones (1995) mention that since we have one value of h for all observations,
and not a different one for every variable (in fact it could be a bandwidth matrix H) we
should transform the data so that they have unit covariance matrix first and then try to
choose the value of h. The same is noted in Silverman (1986) who also mentions that instead
of transforming the data, choose h and then transform them back to calculate the kernel den-
sity he says that this is equivalent to using a kernel density estimation where the covariance
matrix is already there
1 n 1 −2
e− 2 h ( x − X i ) T S −1 ( x − X i )
fˆ ( x; h) = ∑ , (5.6)
|S|1/2 nh p i =1 (2π ) p/2
where S is the sample covariance matrix. So we use 5.6 in the calculations of 5.4 and 5.5 in
107
order to choose h and then use 5.3 in order to calculate the kernel density. Silverman (1986)
advises us to use a robust version S. We have implemented the next code using the classical
covariance matrix, but there is the option to change it.
n <- dim(x)[1]
d <- dim(x)[2] ## sample and dimensionality of x
s <- cov(x) ## can put a robust covariance matrix here if you want
cv <- numeric( length(h) )
if (ncores > 1) {
require(doParallel, quiet = TRUE, warn.conflicts = FALSE)
oop <- options(warn = -1)
on.exit(options(oop))
val <- matrix(h, ncol = ncores) ## if the length of h is not equal to the
## dimensions of the matrix val a warning message should appear
## but you will not see it
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl) ## make the cluster
ww <- foreach(j = 1:ncores, .combine = cbind) %dopar% {
ba <- val[, j]
for ( l in 1:length( val[, j] ) ) {
a <- a2a^( 1 / val[l, j]^2 )
108
( rowSums( a ) - 1 ) / (n - 1)
ba[l] <- mean( log(f) )
}
return(ba)
}
stopCluster(cl) ## stop the cluster
cv <- as.vector(ww)[ 1:length(h) ]
} else{
for ( j in 1:length( h ) ) {
a <- a2a^( 1 / val[l, j]^2 )
f <- ds / (2 * pi) ^ (d/2) * ( 1 / h[j]^d ) *
( rowSums( a ) - 1 ) / (n - 1)
cv[j] <- mean( log(f) )
}
}
if ( plot ) {
plot(h, cv, type = "l")
}
hopt <- h[ which.max(cv) ]
list(hopt = hopt, cv = cv)
}
I could have incorporated the mkde.function inside mkde, but I wanted to leave them as
two separate functions so that the user has a better control of what is happening. An alter-
native and faster function is given below. The idea is simple, there is one parameter, h over
which we want to maximize an objective function. So, it is a unidimensional optimization
and the command optimize in R will do the rest.
109
z <- x %*% B
a2a <- Rfast::Dist( z, square = TRUE )
a2a <- exp(-0.5 * a2a)
ds <- 1 / prod(lam)^0.5
This is a discrete distribution, nevertheless, it falls within the multivariate context, even in
the two dimensions. In order to generate values from this distribution one needs three
independent Poisson variables, X1 ∼ Po (λ1 ), X2 ∼ Po (λ2 ) and X3 ∼ Po (λ3 ). Then,
( X, Y ) = (Y1 , Y2 ) = ( X1 + X3, X2 + X3 ) ∼ BP (λ1 , λ2 , λ3 ). This was a way to simulate
random values from the bivariate Poisson whose representation is given by
y min ( x,y)
x
λ3 k
−(λ1 +λ2 +λ3 ) λ1λ2 x y
P ( X = x, Y = y) e
x! y! ∑ k k
k!
λ1 λ2
. (5.7)
k =0
The above form is found in Karlis and Ntzoufras (2003). This bivariate distribution al-
lows for dependence between the two random variables. Marginally each random variable
follows a Poisson distribution with E ( X ) = λ1 + λ3 and E (Y ) = λ2 + λ3 . In addition,
Cov ( X, Y ) = λ3 . If λ3 = 0, (5.7) becomes a product of two Poisson distributions. Hence,
λ3 is a measure of dependence between the two random variables. For more information
about this distribution one can see Kocherlakota and Kocherlakota (1998) and for multivari-
ate generalisations see Karlis (2003) and Karlis and Meligkotsidou (2005).
The next R code has the following stuff
110
n n
the MLE of λ1 + λ3 and λ2 + λ3 are given by ∑i=n1 i and ∑i=n1 i respectively, where n denotes
x y
the sample size. Therefore, we can rewrite (5.7) as a function of λ3 only and thus maximise
n n
the log-likelihood with respect to λ3 only. Then λ̂1 = ∑i=n1 i − λ̂3 and λ̂2 = ∑i=n1 i − λ̂3 .
x y
Correlation coefficient
This form of the bivariate Poisson (5.7) allows for non negative correlation. The correlation
coefficient is given by
λ3
ρ ( X, Y ) = p . (5.8)
( λ1 + λ3 ) ( λ2 + λ3 )
The sample correlation coefficient comes from substituting the parameters by their sam-
ple estimates. We had done some simulations (Tsagris et al., 2012) comparing the Pearson
correlation coefficient with the MLE based one (5.8) and we found that the Pearson behaves
very well in terms of coverage of the confidence interval. The thing with the Pearson corre-
lation coefficient is that it can take negative values as well, wheres (5.8) cannot. But, it can
happen, that even under this model, negative sample correlation is observed.
where
h i
δ2 = − (λ1 + λ2 ) + (λ1 + λ3 ) (λ2 + λ3 ) − λ23 (τ − 1) and
∞ ∞
P ( X = r − 1, Y = s − 1)2
τ = ∑ ∑ .
r =1 s =1
P ( X = r, Y = s)
Hypothesis testing for H0 : λ3 = 0 versus H1 : λ3 > 0 is performed via the Wald test
statistic Kocherlakota and Kocherlakota (1998) W = q λ̂3 . As for the variance of λ̂3 we
Var(λ̂3 )
can take it from the covariance matrix we saw before or using the observed information
matrix (available from the optim function). Under the null hypothesis, the asymptotic distri-
111
bution of W is N (0, 1). Alternatively we can use the log-likelihood ratio test with a χ21 .
In order for the log-likelihood ratio test not to reject H0 at α = 5%, the following must hold
true 2 (`1 − `0 ) ≤ χ21,0.95 , where `1 and `0 are the maximised log-likelihood values under
H1 and H0 respectively. Thus, an asymptotic 95% confidence interval for λ3 consists of the
χ2
values of λ̂3 for which the log-likelihood is more than `1 − 1,0.95
2 . Alternatively, asymptotic
standard normal confidence intervals can be used, since we know the variance of λ̂3 .
bp <- function(x1, x2, plot = FALSE) {
## x1 and x2 are the two variables
n <- length(x1) ## sample size
m1 <- sum(x1) / n ; m2 <- sum(x2) / n
## m1 and m2 estimates of lambda1* and lambda2* respectively
## funa is the function to be maximised over lambda3
ind <- Rfast::rowMins( cbind(x1, x2), value = TRUE )
max1 <- max(x1)
max2 <- max(x2)
mm <- max( max1, max2 )
mn <- min(max1, max2)
omn <- 0:mn
fac <- factorial( omn )
ch <- matrix(numeric( (mm + 1)^2 ), nrow = mm + 1, ncol = mm + 1 )
rownames(ch) <- colnames(ch) <- 0:mm
for ( i in 1:c(mm + 1) ) {
for ( j in c(i - 1):c(mm + 1) ) {
ch[i, j] <- choose(j, i - 1)
}
}
ly1 <- lgamma(x1 + 1)
ly2 <- lgamma(x2 + 1)
112
for (j in 1:n) {
f1[j] <- x1[j] * l1 - ly1[j] + x2[j] * l2 - ly2[j]
f2[j] <- log( sum( ch[ 1:c(ind[j] + 1), x1[j] ] *
ch[ 1:c(ind[j] + 1), x2[j] ] * fac[1:c(ind[j] + 1)] *
expo[ 1:c(ind[j] + 1) ] ) )
}
n * con + sum(f1) + sum( f2[abs(f2) < Inf] )
}
for (r in 1:ma) {
for (s in 1:ma) {
i <- 0:min(r, s)
comon <- factorial(i) * ( l3/( (m1 - l3) * (m2 - l3) ) )^i
113
f1[r, s] <- con + (r - 1) * log(m1 - l3) - lgamma(r) +
(s - 1) * log(m2 - l3) - lgamma(s) + log(sum( choose(r - 1, i) *
choose(s - 1, i) * comon ) )
if ( plot ) {
ci <- rbind( c(a1, a2), ci )
rownames(ci)[1] <- ’Log-likelihood’
}
114
list(lambda = lambda, rho = rho, ci = ci, loglik = loglik, pvalue = pvalue)
}
Kocherlakota and Kocherlakota (1998) mention the following a goodness of fit test for the
bivariate Poisson distribution, the index of dispersion test. It is mentioned in Kocherlakota
and Kocherlakota (1998) that Loukas and Kemp (1986) developed this test as an extension of
the univariate dispersion test. They test for departures from the bivariate Poisson againsta
alternatives which involve an increase in the generalised variance, the determinant of the
covariance matrix of the two variables.
Rayner et al. (2009) mention a revised version of this test whose test statistic is now given
by
s
n S2 S12 S22 S22
IB ∗ = 1 − 2r2 + ,
1 − r2 x̄1 x̄1 x̄2 x̄2
where n is the sample size, r is the sample Pearson correlation coefficient, S12 and S22 are the
two sample variances and x̄1 and x̄2 are the two sample means. Under the null hypothesis
the IB∗ follows asymptotically a χ2 with 2n − 3 degrees of freedom. However, I did some
simulations and I saw that it does not perform very well in terms of the type I error. If you
see the simulations in their book Rayner et al., 2009, pg. 132 you will see this. For this reason,
the next R function calculates the p-value of the IB∗ using Monte Carlo.
for (i in 1:R) {
115
z3 <- rpois(n, lambda[3])
z1 <- rpois(n, lambda[1]) + z3
z2 <- rpois(n, lambda[2]) + z3
r <- cor(z1, z2)
m1 <- sum(z1)/n ; m2 <- sum(z2)/n
s1 <- ( sum(z1^2) - n * m1^2 ) / (n - 1)
s2 <- ( sum(z2^2) - n * m2^2 ) / (n - 1)
tb[i] <- n/(1 - r^2) * ( s1 / m1 + s2 / m2 -
2 * r^2 * sqrt( s1 / m1 * s2 / m2 ) )
}
The function biv.gof given below is the vectorised version of bp.gof. The work similar to
the one described in te bootstrap correlation coefficient. There are 5 terms required for the
correlation and then 4 of them are used in the test statistic. Note, that to calculate the corre-
lation coefficient we have used an alternative form of the correlation formula, which suited
here better. The results are the same as before. biv.gof is much faster for small to moderate
sample sizes, but for bigger samples the time differences with bp.gof become smaller. In any
case, it’s good to be here, so that you can see this one as well and think how to vectorise (if
possible) your functions.
116
v1 <- Rfast::colVars(z1) ; v2 <- Rfast::colVars(z2)
sxy <- Rfast::colsums(z1 * z2)
rb <- (sxy - n * m1 * m2) / ( (n - 1) * sqrt( v1 * v2 ) )
tb <- n/(1 - rb^2) * ( v1 / m1 + v2 / m2 - 2 * rb^2 *
sqrt( v1 / m1 * v2 / m2 ) )
Minka (2000) suggested a nice and fast way to estimate the parameters of the Dirichlet-
multinomial distribution. But, at first let us see what is this distribution. Assume you have
multinomial data, as shown in Table 5.1, where each row is a vector. The multinomial is the
generalisation of the binomial to more than two categories. that Note that for the multino-
mial distribution, the row sums are assumed to be the same. For the Dirichlet-multinomial
this is not the case.
In some, the vector of parameters a can be estimated via the fixed-point iteration
n
∑
( k +1) (k) j = 1 ψ x ij + a i − nψ ( ai )
ai = ai h i ,
n D D
∑ j =1 ∑i=1 ij ∑ i =1 i
ψ x + a i − nψ a
117
Γ0 (y)
where ψ (y) = (log Γ (y))0 = Γ(y)
is the digamma function.
dirimultinom2 <- function(x, tol = 1e-07) {
## x is the data, integers
## tol is the tolerannce level, set to 10^(-7) by default
p <- dim(x)[2] ## dimensionality
n <- dim(x)[1] ## sample size
lik <- NULL
rs <- Rfast::rowsums(x)
a1 <- Rfast::colmeans(x)
sa <- sum(a1)
x <- t(x)
y <- x + a1
lik[1] <- n * lgamma( sa ) - sum( lgamma( rs + sa ) ) +
sum( lgamma( y ) ) - n * sum( lgamma( a1 ) )
i <- 2
while ( (lik[i] - lik[i-1] > tol) ) {
i <- i + 1
a1 <- a2
up <- Rfast::rowsums( digamma( x + a1 ) ) -
n * digamma(a1)
down <- sum( digamma( rs + sum(sa) ) ) -
n * digamma( sa )
a2 <- a1 * up / down
sa <- sum(a2)
lik[i] <- n * lgamma( sa ) - sum( lgamma( rs + sa ) ) +
sum( lgamma( x + a2 ) ) - n * sum( lgamma( a2 ) )
}
118
runtime <- proc.time() - runtime
list(runtime = runtime, iter = i, loglik = lik[i], param = a2)
}
rs <- Rfast::rowsums(x)
a1 <- Rfast::colmeans(x)
x <- t(x)
y <- x + a1
sa <- sum(a1)
lik1 <- n * lgamma( sa ) - sum( lgamma( rs + sa ) ) - n * sum( lgamma( a1 ) ) +
sum( lgamma( y ) )
f <- n * digamma(sa) - sum( digamma(rs + sa) ) - n * digamma(a1) +
rowsums( digamma(y) )
f2 <- matrix(n * trigamma(sa) - sum( trigamma(rs + sa) ), p, p)
diag(f2) <- diag(f2) - n * trigamma(a1) + rowsums( trigamma(y) )
a2 <- a1 - solve(f2, f)
sa <- sum(a2)
lik2 <- n * lgamma( sa ) - sum( lgamma( rs + sa ) ) - n * sum( lgamma( a2 ) ) +
sum( lgamma( x + a2 ) )
i <- 2
while ( sum( lik2 - lik1 ) > tol ) {
i <- i + 1
lik1 <- lik2
a1 <- a2
y <- x + a1
f <- n * digamma(sa) - sum( digamma(rs + sa) ) - n * digamma(a1) +
rowsums( digamma(y) )
119
f2 <- matrix(n * trigamma(sa) - sum( trigamma(rs + sa) ), p, p)
diag(f2) <- diag(f2) - n * trigamma(a1) + rowsums( trigamma(y) )
a2 <- a1 - solve(f2, f)
sa <- sum(a2)
lik2 <- n * lgamma( sa ) - sum( lgamma( rs + sa ) ) - n * sum( lgamma( a2 ) ) +
sum( lgamma( x + a2 ) )
}
The previous function gives rise to a way to simulate from a multivariate normal with some
specific parameters. The idea is simple. Suppose we want to generate n values from a p-
variate normal with parameters µ and Σ using the rnorm function only. The algorithm is
described below
Σ = Vdiag λ1 , . . . , λ p V T .
2. Take the square root of the covariance matrix Σ 1/2 = Vdiag λ1/2 1/2
1 , . . . , λp VT .
4. Put the generated values in a matrix with n rows and p columns randomly. We will
call this matrix X.
Σ 1/2 + µ .
5. Construct Y = XΣ
The columns in the Y matrix follow the multivariate normal with the specified parameters.
Bear in mind that the covariance matrix needs to be of full rank. The function is faster than
the mvrnorm funciton in the MASS library.
120
5.2.2 Random values generation of covariance matrices (Wishart distribution)
I have written a simple code to generate covariance matrices based on the Wishart distri-
bution. If Xi Np (0, Σ ), then A = ∑iν=1 Xi XiT follows a p-variate Wishart distribution with
parameters ν and Σ , A ∼ W p (Σ , ν) (Anderson, 2003). Its density is given by
νp tr Σ −1 A
( )
|A| 2 e− 2
f (A) = p ( p −1)
νp ν p ν +1− i
22π 4 |Σ | 2 ∏i=1 Γ 2
The algorithm to generate covariance matrices from a Wishart distribution with expected
Σ is
value equal to νΣ
1. Generate say ν random values Xi from a Np (0, Σ ). Note, ν must be greater than p. So,
if you have more dimensions than ν, change this number.
2. Thee matrix Xi XiT is a random matrix from a Wishart distribution with ν degrees of
freedom.
The ν parameter is the degrees of freedom of the Wishart distribution. Suppose you have
a sample of p multivariate normal vectors X of size nu and calculate the X T X. The degrees of
freedom of the Wishart distribution is nu. I found out that there is a similar built-in function
in R the rWishart(n, df, Sigma). There is also a function in the package bayesm written by
Rossi (2015) for those who are more interested in these.
121
}
sim
}
There is a command available through the mvtnorm package for generating from a multi-
variate t distribution with some given parameters. We also provide a function for doing
that.
The basic relationship one needs to generate values from a multivariate t distribution
with parameters µ , Σ and ν is the following
r
ν 1/2
Y=µ+ Σ Z,
χ2ν
where Z ∼ Np 0, Ip . So, basically, the algorithm is the same as in the multivariate normal
distribution. The difference is the extra parameter ν.
We will now provide the code to generate random values from the multivariate Laplace
distribution whose density is given in (ref5.2). The basic equation is (Eltoft et al., 2006)
√
Y=µ+ Γ 1/2 X,
WΓ
where µ is the mean vector and Γ is a covariance type matrix whose determinant is 1. X is
a multivariate normal distribution with zero mean vector and a covariance matrix equal to
the identity matrix. Finally, W is a univariate exponential distribution. So basically we need
a multivariate normal, a mean vector and a covariance matrix and a univariate exponential
distribution to generate values from the multivariate Laplace distribution (5.2).
122
rmvlaplace <- function(n, lam, mu, G) {
## n is the sample size
## lam is the parameter of the exponential distribution
## mu is the mean vector
## G is a d x d covariance matrix with determinant 1
if ( summary( det(G) )[1] == 1 ) {
y <- paste("The determinant of the covariance matrix is not 1.")
} else {
d <- length(mu) ## dimensionality of the data
z <- rexp(n, lam)
x <- matrix( RcppZiggurat::zrnorm(n * d), ncol = d )
y <- sqrt(z) * x %*% chol(G) + rep(mu, rep(n, d) )## the simulated sample
}
y
}
123
}
We will provide a function to obtain the parameters of the fitted distribution, plot the bivari-
ate data and then add contour lines on the same plot. For the t distribution we require the
MASS library and the function we presented before to calculate its associated parameters.
The idea is to take a grid of points along the two axis and for each point to calculate the
value of the fitted density. Then, use the ready built-in function in R contour and that’s it.
An alternative distribution which can also be used to model compositional data is the multi-
variate skew-normal distribution (Azzalini and Valle, 1996). The density of the skew-normal
distribution is (Azzalini, 2005)
2 −1 T
h i
e− 2 (y−ξ )Ω (y−ξ ) Φ α T ω −1 (y − ξ ) ,
1
f d (y) = (5.9)
Ω 1/2
2πΩ
where Φ (·) is the cumulative distribution of the standard normal distribution, ω is the di-
agonal matrix containing the square root of diag (Ω ) and α is the shape parameter (α ∈ Rd ).
If α = 0, then we end up with the multivariate normal distribution. The parameter δi
is related to the i-th skewness coefficient as well. The skew normal can only model low
skewness since the skewness coefficient cannot exceed the value 0.99527 in absolute value.
Thus, for the numerical maximization of the log-likelihood of (5.9), good initial values for
the vector δ are the skewness coefficients. If any of the coefficient exceeds the cut-off value
0.99527, in either direction, the initial starting value is set equal to this value.
The expected value and variance matrix of the skew-normal distribution are expressed
as follows
124
## the user must make sure he/she has bivariate data. If the data are
## not bivariate the function will not work
## the default distribution in normal, but there are other options, such as
## t and skew normal
m <- Rfast::colmeans(x) ## mean vector
s <- cov(x) ## covariance matrix
n1 <- 100
n2 <- 100
x1 <- seq( min(x[, 1]) - 0.5, max(x[, 1]) + 0.5, length = n1 )
x2 <- seq( min(x[, 2]) - 0.5, max(x[, 2]) + 0.5, length = n1 )
## if for example the y axis is longer than the x axis, then you might
## want to change n2
if (type == ’normal’) {
r <- cor(x[, 1], x[, 2])
con <- - log(2 * pi) - log( det(s) ) ## constant part
z1 <- ( x1 - m[1] ) / sqrt( s[1, 1] )
z2 <- ( x2 - m[2] ) / sqrt( s[2, 2] )
mat1 <- outer(z1^2, rep(1, n1), "*")
mat2 <- outer(rep(1, n2), z2^2, "*")
mat3 <- tcrossprod(z1, z2)
mat <- con - 0.5 /(1 - r^2) * (mat1 + mat2 - 2 * r * mat3)
mat <- exp(mat)
ind <- ( mat < Inf )
ind[ ind == FALSE ] <- NA
mat <- mat * ind
## we did this to avoid any issues with high numbers
contour(x1, x2, mat, nlevels = 10, col = 2, xlab = colnames(x)[1],
ylab = colnames(x)[2])
points(x[, 1], x[, 2])
points(m[1], m[2], pch = 10, col = 2, cex = 1.5)
param = list(mesos = m, covariance = s)
125
v <- f$df
con <- lgamma( (v + 2) / 2 ) - lgamma(v / 2) - 0.5 * log( det(pi * v * s) )
z1 <- ( x1 - m[1] ) / sqrt(s[1, 1])
z2 <- ( x2 - m[2] ) / sqrt(s[2, 2])
mat1 <- outer(z1^2, rep(1, n1), "*")
mat2 <- outer(rep(1, n2), z2^2, "*")
mat3 <- tcrossprod(z1, z2)
mat <- con - 0.5 * (v + 2) * log1p( 1 /(1 - r^2) *
(mat1 + mat2 - 2 * r * mat3) / v )
mat <- exp(mat)
ind <- ( mat < Inf )
ind[ind == FALSE] <- NA
mat <- mat * ind
## we did this to avoid any issues with high numbers
contour(x1, x2, mat, nlevels = 10, col = 2, xlab = colnames(x)[1],
ylab = colnames(x)[2])
points(x[, 1], x[, 2])
points(m[1], m[2], pch = 10, col = 2, cex = 1.5)
param = list(center = m, scatter = s, df = v)
126
gamma = para$gamma1)
}
param
The task is the same as before, the difference now is that the fitted distribution is the bivariate
log-normal.
127
5.3.3 Contour plot of a bivariate inverted Dirichlet distribution
The next function shows how to the contour plots of an inverted Dirichlet distribution look
like.
n1 <- 100
n2 <- 100 ## n1 and n2 specify the number of points taken at each axis
da <- invdir.est(x)
a <- da$alpha
x1 <- seq(max(min(x[, 1]) - 1, 0.01), max(x[, 1]) + 1, length = n1)
x2 <- seq(max(min(x[, 2]) - 1, 0.01), max(x[, 2]) + 1, length = n2)
mat <- matrix(NA, nrow = n1, ncol = n2)
con <- lgamma( sum(a) ) - sum( lgamma(a) )
suma <- sum(a)
ra <- a[1:2] - 1
for (i in 1:n1) {
for (j in 1:n2) {
z <- c(x1[i], x2[j])
f <- con + log(z) %*% ra - suma * log(1 + sum(z))
if (exp(f) < Inf) mat[i, j] <- exp(f)
}
}
contour(x1, x2, mat, nlevels = 10, col = 2, xlab = "x1", ylab = "x2")
points(x[, 1], x[, 2])
da
}
The idea is the same as before, take a grid of points and for each point calculate its kernel
density estimate.
128
## thumb can be either FALSE so the specified h is used or TRUE, so
## that the Scott (or Silverman) rule is used.
n <- dim(x)[1] ## sample size
## the user must make sure he/she has bivariate data.
## If the data are not bivariate the function will not work
if ( thumb ) {
s <- Rfast::colVars(x, std = TRUE)
h <- diag(s * n^(-1/6))
} else if ( !is.matrix(h) ) {
h <- diag(rep(h, 2))
} else h <- h
ha <- solve(h^2)
con <- prod( diag(h) )
n1 <- 100 ## n1 and n2 specify the number of points taken at each axis
x1 <- seq( min(x[, 1]) - 1, max(x[, 1]) + 1, length = n1 )
x2 <- seq( min(x[, 2]) - 1, max(x[, 2]) + 1, length = n1 )
mat <- matrix(NA, nrow = n1, ncol = n1)
for (i in 1:n1) {
for (j in 1:n1) {
a <- as.vector( mahala(x, c( x1[i], x2[j] ), ha, inverted = TRUE ) )
can <- 1/(2 * pi) * ( 1/con ) * sum( exp(-0.5 * a) )/n
if ( abs(can) < Inf ) mat[i, j] <- can
}
}
The task is the same as before. It is not very usual to do contour plots for discrete distribu-
tions, but I did it so that anyone can see how they look like. The function needs the bivariate
data and the estimates of the three parameters. The R function is given below.
129
## lambda contains the three values of the three parameters
lam1 <- lambda[1]
lam2 <- lambda[2]
lam3 <- lambda[3]
z1 <- seq(max(min(x1) - 3, 0), max(x1) + 3)
n1 <- length(z1)
z2 <- seq(max(min(x2) - 3, 0), max(x2) + 3)
n2 <- length(z2)
mat <- matrix(nrow = n1, ncol = n2)
l1 <- log(lam1)
l2 <- log(lam2)
ls <- -(lam1 + lam2 + lam3)
rho <- lam3/(lam1 * lam2)
for (i in 1:n1) {
for (j in 1:n2) {
f1 <- ls + z1[i] * l1 - lgamma(z1[i] + 1) + z2[j] * l2 - lgamma(z2[j] + 1)
k <- 0:min(z1[i], z2[j])
f2 <- log( sum( choose(z1[i], k) * choose(z2[j], k) * factorial(k) * rho^k ) )
f <- f1 + f2
mat[i, j] <- exp(f)
}
}
contour(z1, z2, mat, nlevels = 10, col = 2, xlab = "x1", ylab = "x2")
points(x1, x2)
}
130
6 Covariance, principal component analysis and singular value
decomposition
6.1 Fast covariance and correlation matrices
I found this information in stackoverflow, written by Shabalin (Shabalin, 2012) and I adver-
tise it here. There are two functions, one for covariance and one for the correlation matrix.
Use big sample sizes and or many variables to see differences of up to 50% when compared
with R’s standard cov and cor functions, when you have more than 1, 000 dimensions. Try
it with 5, 000 variables to see. If you have small matrices, these functions might take a bit
longer, but the difference is almost negligible.
The next function is the one we have in Rfast (Papadakis et al., 2019).
131
mahala <- function (x, m, s, inverted = FALSE) {
## x must be a matrix
## m is the mean vector and
## s is the covariance matrix
## if s is the inverse of the covariance matrix
## put inverted = TRUE
y <- t(x) - m
if ( !inverted ) {
di <- Rfast::colsums( y * crossprod( chol2inv( chol(s) ), y ) )
} else di <- Rfast::colsums(y * crossprod(s, y))
di
}
132
not pay too much attention. At first we have to subtract the mean vector from the data and
then multiply by the square root of the inverse of the covariance matrix
Z = (X − µ ) Σ −1/2 .
The key thing is to decompose the covariance matrix, using Cholesky or eigen decom-
position. We prefer the latter for simplicity and convenience. The spectral decomposition of
the covariance matrix (or any positive definite square matrix in general) is
Λ V T = Vdiag λ1 , . . . , λ p V T ,
Σ = VΛ
where V is the matrix containing the eigenvectors, an orthogonal matrix and λ1 , . . . , λ p are
the p eigenvalues (the number of dimensions), where λ1 ≥ λ2 ≥ . . . ≥ λ p > 0. The square
root, the inverse of Σ and its inverse square root can be written as
Σ 1/2 = Vdiag λ1/2
1 , . . . , λ 1/2
p VT
Σ −1 = Vdiag λ1−1 , . . . , λ−p
1
V T and
−1/2 −1/2 −1/2
Σ = Vdiag λ1 , . . . , λ p VT
respectively.
Actually any power of a positive definite matrix can be calculated this way. If the covariance
matrix is not of full rank (equal to p), that is if there is at least one eigenvalue equal to zero, it
becomes clear why the inverse does not exist. Another thing to highlight is that the number
of non zero eigenvalues is equal to the rank of the matrix (or vice versa). The following
function performs this transformation using eigen decomposition of the covariance matrix.
Alternatively another standardization is simply to center the variables (subtract from
each variable each mean) and then divide by its standard deviation zi = xi −s mi , for i = 1, ..., p.
i
A similar, but robust, way is to use the median and the median absolute deviation instead.
Note that the built in command in R scale is used to center the data and make their
standard deviations equal to 1. See its help for the options it offers.
133
## x is the data
## type is either ’matrix’, ’mean’ or ’median’
if (type == "matrix") {
s <- cov(x) ## covariance matrix
B <- solve( chol(s) )
m <- Rfast::colmeans(x)
y <- t(x) - m
z <- crossprod(y, B)
} ## multivariate standardization
if (type == "mean") {
m <- Rfast::colmeans(x)
s <- Rfast::colVars(x, std = TRUE)
z <- t( ( t(x) - m ) / s )
}
if (type == "median") {
m <- Rfast::colMedians(x)
y <- t( t(x) - m )
s <- Rfast::colMedians( abs(y) ) / qnorm(3/4)
z <- t( t(y) / s )
}
z
}
X = UDV T , (6.1)
134
of an n × p matrix is equal to min{n, p}. Using (6.1), each column of X can be written as
r
xj = ∑ uk dk v jk .
k =1
This means that we can reconstruct the matrix X using less columns (if n > p) than it has.
m
x̃m
j = ∑ uk dk v jk , where m < r.
k =1
The reconstructed matrix will have some discrepancy of course, but it is the level of
discrepancy we are interested in. If we center the matrix X, subtract the column means
from every column, and perform the SVD again, we will see that the orthonormal matrix V
contains the eigenvectors of the covariance matrix of the original, the un-centred, matrix X.
Coming back to the a matrix of n observations and p variables, the question was how
many principal components to retain. We will give an answer to this using SVD to recon-
struct the matrix. We describe the steps of this algorithm below.
3. Choose a number from 1 to r (the rank of the matrix) and reconstruct the matrix using
e m the reconstructed matrix.
(6.1). Let us denote by Y
4. Calculate the sum of squared differences between the reconstructed and the original
values
n p 2
PRESS (m) = ∑∑ ỹijm − yij , m = 1, .., r.
i =1 j =1
5. Plot PRESS (m) for all the values of m and choose graphically the number of principal
components.
The graphical way of choosing the number of principal components is not the best and there
alternative ways of making a decision (see for example ?). The code in R is given below.
135
A <- svd(x) ## SVD of the centred matrix
u <- A$u
d <- A$d
v <- t( A$v )
p <- length(d)
press <- numeric(p)
y <- 0
for (i in 1:p) {
y <- y + u[, i, drop = FALSE] %*% ( d[i] * v[i, , drop = FALSE] )
press[i] <- sqrt( sum( (y - x)^2 ) ) ## calculation of the PRESS
}
runtime <- proc.time() - runtime
dev.new()
plot(val, type = "b", pch = 9, xlab = "Number of components",
ylab = "Eigenvalues")
dev.new()
136
6.6 Choosing the number of principal components using probabilistic
PCA
Probabilistic PCA assumes that the principal components come from a multivariate normal
distribution. In fact one can decompose the covariance matrix as Σ = WW T + σ2 I p , where
I p is the identity matrix in p dimensions. The maximum likelihood solution is given by
Tipping and Bishop (1999)
1/2 p
1
W ML = Uq Λ q − σML
2
Iq 2
and σML =
p−q ∑ λj,
j = q +1
where λ j are the eigenvalues of the covariance matrix, Λ q is a diagonal matrix with these
eigenvalues and Uq is the matrix containing the corresponding eigenvectors.
At first one performs eigen analysis on the covariance matrix and extracts the eigenval-
ues λ j and the matrix with the eigenvectors U. Then, estimate the variance ”lost” over the
projection σML2 and finally the matrix W
ML .
This is the closed form solution, but also an EM based solution can be found for the case
where there are missing data (Tipping and Bishop, 1999) and expanded to the multivariate
t distribution by Zhao and Jiang (2006). The matrix W ML does not contain unit vectors, but
if normalised, they are the same as the eigenvectors. Notealso that the covariance matrix
using eigen analysis (or spectral decomposition) is written as Σ = WΛ Λ W T . What is offered
though by this approach is an automated method for selection of principal components
(Minka, 2001).
The first log-likelihood is the BIC approximation given by
q
n n ( p − q) m+q
BIC = −
2 ∑ log λ j − 2
2
log σML −
2
log n,
j =1
where n is the sample size and m = p( p − 1)/2 − ( p − q)( p − q − 1)/2. The second log-
likelihood is a modification of the log-likelihood (Minka, 2001) as proposed by Rajan and
Rayner (1997)
q
np np λ j n ( p − q) np
RR = −
2
log (2π ) −
2 ∑ log q
−
2
2
log σML −
2
.
j =1
137
eig <- prcomp(x)
lam <- eig$sd^2
vec <- eig$rotation
sigma <- cumsum( sort(lam) ) / (1:p)
sigma <- sort(sigma, decreasing = TRUE)[-1]
lsigma <- log(sigma)
#for (i in 1:p) {
# H <- vec[, 1:i] %*% ( lam[1:i] * diag(i) - sigma[i] * diag(i) )
#}
m <- p * (p - 1) / 2 - ( p - 1:c(p - 1) ) * ( p - 1:c(p - 1) - 1 ) / 2
bic <- - n / 2 * cumsum( log(lam)[ 1:c(p - 1) ] ) -
n * ( p - 1:c(p - 1) ) / 2 * lsigma - ( m + 1:c(p - 1) ) / 2 * log(n)
rr <- -n * p / 2 * log(2 * pi) - n * p / 2 - n * ( 1:c(p - 1) ) / 2 *
log( mean( lam[ 1:c(p - 1) ] ) ) - n * ( p - 1:c(p - 1) ) / 2 * lsigma
runtime <- proc.time() - runtime
138
list(sigma = sigma, wml = t(wml) )
}
6.7 Confidence interval for the percentage of variance retained by the first
κ components
The algorithm is taken by Mardia et al., 1979, pg. 233-234. The percentage retained by the
fist κ principal components denoted by ψ̂ is equal to
∑κi=1 λ̂i
ψ̂ = p
∑ j=1 λ̂ j
2 h i
τ2 = (1 − ψ)2 λ21 + ... + λ2k + ψ2 λ2κ +1 + ...λ2p
Σ )2
(n − 1) (trΣ
Σ2
2trΣ
2
= ψ − 2αψ + α ,
Σ )2
(n − 1) (trΣ
where
α = λ21 + ... + λ2k / λ21 + ... + λ2p Σ 2 = λ21 + ... + λ2p
and trΣ
The bootstrap version provides an estimate of the bias, defined as ψ̂boot − ψ̂ and con-
fidence intervals calculated via the percentile method and via the standard (or normal)
method (Efron and Tibshirani, 1993). The code below gives the option to perform bootstrap
or not by making the (B) equal to or greater than 1.
139
trasu2 <- sum(lam^2)
alpha <- sum( (lam^2)[1:k] ) / trasu2
t2 <- ( 2 * trasu2 * (psi^2 - 2 * alpha * psi + alpha) )/
( (n - 1) * (trasu^2) )
ci <- c(psi - qnorm(1 - a/2) * sqrt(t2), psi +
qnorm(1 - a/2) * sqrt(t2))
result <- c(psi, ci = ci)
names(result) <- c( ’psi’, paste(c( a/2 * 100, (1 - a/2) * 100 ),
"%", sep = "") )
}
if (B > 1) {
## bootstrap version
tb <- numeric(B)
for (i in 1:B) {
b <- sample(1:n, n, replace = TRUE)
lam <- prcomp(x[b, ])$sd^2
tb[i] <- sum( lam[1:k] ) / sum(lam)
}
conf1 <- c( psi - qnorm(1 - a/2) * sd(tb), psi + qnorm(1 - a/2) * sd(tb) )
conf2 <- quantile(tb, probs = c(a/2, 1 - a/2))
hist(tb, xlab = "Bootstrap percentages", main = "")
abline(v = psi, lty = 2, lwd = 2)
abline(v = mean(tb), lty = 1, lwd = 3)
ci <- rbind(conf1, conf2)
legend( conf2[1], B/10, cex = 0.8, c("psi", "bootstrap psi"),
lty = c(2, 1), lwd = c(2, 3) )
colnames(ci) <- paste(c( a/2 * 100, (1 - a/2) * 100 ), "%", sep ="")
rownames(ci) <- c("standard", "empirical")
res <- c(psi, mean(tb), mean(tb) - psi )
names(res) <- c(’psi’, ’psi.boot’, ’est.bias’)
result <- list(res = res, ci = ci)
}
result
}
140
6.8 A metric for covariance matrices
A metric for covariance matrices is the title of a paper by Förstner and Moonen (2003) and
this is what we will show here now. The suggested metric between two p × p non-singular
covariance matrices is
p h i2
2
d (A, B) = ∑ log λi AB −1
,
i =1
where λi stands for the i-th eigenvalue and note that the order of the multiplication of the
matrices is not important.
√1 − √12
0 0 ... ... 0
2
1 ..
√ √1 − √26 0 ... 0 .
6 6
.. .. .. .. ... .. ..
. . . . . .
H= (6.2)
√ 1 √ 1
−√ i
i ( i +1) ... 0 ... 0
i ( i +1) i ( i +1)
.. .. .. .. .. .. ..
. . . . . . .
√1 ... ... ... ... √1 − √dD
dD dD dD
141
The R-code for the Helmert sub-matrix is
• AA− A = A
• A− AA− = A−
∗
• (AA− ) = AA−
∗
• (A− A) = A− A,
where A∗ stands for the Hermitian transpose (or conjugate transpose) of A. (Schaefer et al.,
2007) have written the corpcor package which finds the Moore-Penrose pseudo-inverse ma-
trix. There is the command ginv inside MASS and I am doing the same stuff. Let us now see
how we calculate the pseudo-inverse matrix using singular value decomposition (SVD).
We will repeat the same stuff as in Section 6.5. Suppose we have a p × p matrix X, which
is of rank r ≤ p. Using SVD we can write the matrix as
A = UDV T ,
142
This is exactly the key point we need, the r positive singular values and thus, the r columns
and rows of the U and V respectively. The pseudo-inverse of A is then given by
A− = UDr−1 V T ,
F = [ I d : − j d ]. (6.3)
This is simply the identity matrix with one extra column to the right with all elements equal
to −1. Then the pseudo-inverse Γ− is equal to:
−1
Γ− = F T FΓF T F
143
You can also use the Helmert matrix of A in ginv function as the F matrix.
where V is the matrix containing the eigenvectors of the matrix A, λ1 , . . . , λ p are the eigen-
values of A and p is the rank of A assuming it is of full rank. A nice explanation of this can
be found at Joachim Dahl’ course webpage (slide No 10).
If on the other hand the matrix is not symmetric, but still square, Chang (1986) gives
the formula via a Taylor series of the exponential function presented below. However, this
holds true only if the matrix has real eivengectors and eigenvalues. If you have complex (or
imaginary numbers) I how do not know how you deal with them.
eA = Ip + A + A2 /2! + A3 /3 + . . .
As for the power of a matrix, Mardia et al., 1979, pg. 470 provides a simple formula
An = Vdiag (en , . . . , en ) VT .
The reader is addressed to Moler and Van Loan (2003) for a a review and thorough dis-
cussion on ways to calculate the matrix exponential.
144
i <- i + 1
a1 <- a2
a2 <- a1 + ( vec %*% lam^i %*% tvec ) / factorial(i)
}
res <- list(iter = i, mat = a2)
} else if ( all(t(A) - A == 0) ) { ## symmetric matrix
explam <-
mat <- vec %*% ( t(vec) * exp( eig$values ) )
res <- list( mat = mat )
}
res
}
145
7 Robust statistics
7.1 Approximate likelihood trimmed mean
Fraiman and Meloche (1999) defined a multivariate trimmed mean which he calls Approxi-
mate Likelihood trimmed mean or APL-trimmed mean. In sum, the APL-trimmed mean for
a dataset X consisting of n observations is defined as
n o
∑in=1 xi 1ˆf (xi ) ≥ γ
µγ = n o ,
n ˆ
∑ i =1 1 f ( x i ) ≥ γ
where 1 {.} is the indicator function and fˆ (xi ) is the kernel density estimate using (5.3). I
have to note that Fraiman and Meloche (1999) used a slightly different way in the calculation
of the kernel density. For the calculation of the kernel density estimate of each point, that
point does not contribute to the kernel density.
The tuning parameter, γ, which determines the amount of trimming one requires, can be
adjusted by solving the equation
1 n nˆ o
n i∑
1 f ( x i ) ≥ γ = 1 − α,
=1
where α is the percentage of observations to be used for the calculation of the APL-trimmed
mean. Based on this mean, or one selected observations, one can calculate the relevant
covariance matrix. The next R function offers this possibility.
146
version of it, when choosing the bandwidth parameter h.
The above iterative procedure is much much faster than using nlm or optim in R. The
stopping criterion I have put is if the sum of the absolute difference between two successive
iterations is less than 10−9 . If at some step k, the spatial median is equal to any of the vectors,
the final result is the spatial median at the k − 1 stp.
147
u2
}
The function below is an older version but I kept it so that you can measure the time with
both of them. Of course the above is to be preferred as it is much faster.
1 n
n i∑
SSC = s ( xi − γ ) s ( xi − γ ) T ,
=1
where γ is the spatial median we saw before and s (x) = kxxk . So at first, we subtract the
spatial median from all observations. Then we normalize each vector (make it unit vector)
and then calculate the classical covariance matrix.
148
if ( is.null(me) ) me <- spat.med(x, tol) ## spatial median of x
y <- x - rep( me, rep(n, p) )
rs <- sqrt ( Rfast::rowsums(y^2) )
crossprod( y / rs ) / n ## SSCM
}
n
∑ k yi − Bxi k .
i =1
If the dependent variable is a vector or a matrix with one column, then the univariate
median regression will be implemented. For more information on the univariate median
regression see the package quantreg created by Koenker (2015). I use this package in order
to obtain coefficients for the univariate regressions of every Yi on the Xs . These would serve
as initial values in the optimization procedure. I was using the OLS coefficients for this
purpose. In some examples I had done, this did not seem to matter. If you have large
sample, or many variables, or many outliers, maybe it will matter. The function spatmed.reg
is the most efficient of them three below.
149
be <- matrix(beta, nrow = p)
est <- x %*% be
sum( sqrt( rowSums((y - est)^2) ) )
}
## we use nlm and optim to obtain the beta coefficients
ini <- matrix(nrow = p, ncol = d)
for (i in 1:d) ini[, i] <- coef( quantreg::rq(y[, i] ~ x[, -1]) )
ini <- as.vector(ini)
qa <- nlm(medi, ini, z = z, iterlim = 10000)
qa <- optim(qa$estimate, medi, z = z, control = list(maxit = 20000))
qa <- optim(qa$par, medi, z = z, control = list(maxit = 20000),
hessian = TRUE)
beta <- matrix( qa$par, ncol = dim(y)[2] )
if ( is.null(xnew) ) {
est = x %*% beta
} else {
xnew <- cbind(1, xnew)
xnew <- as.matrix(xnew)
est <- xnew %*% beta
}
if ( is.null(colnames(y)) ) {
colnames(seb) <- colnames(beta) <- paste("Y", 1:d, sep = "")
} else colnames(seb) <- colnames(beta) <- colnames(y)
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
rownames(beta) <- c("constant", paste("X", 1:p, sep = "") )
if ( !is.null(seb) ) {
rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
}
} else {
rownames(beta) <- c("constant", colnames(x)[-1] )
if ( !is.null(seb) ) rownames(seb) <- c("constant", colnames(x)[-1] )
}
150
if (d == 1) est <- as.vector(est)
runtime <- proc.time() - runtime
tic = proc.time()
B1 = .lm.fit(x, y)$coefficients
est = x %*% B1
res = y - est
di2 = sqrt( rowSums( res^2 ) )
z = x /di2
der = crossprod(z, res)
der2 = - crossprod(x, z) + tcrossprod(der)
B2 = B1 - solve(der2, der)
i = 2
while ( sum( abs( B2- B1 ) ) > tol ) {
B1 = B2
est = x %*% B1
res = y - est
di2 = sqrt( rowSums( res^2 ) )
z = x / di2
151
der = crossprod(z, res)
der2 = - crossprod(x, z) + tcrossprod(der)
i = i + 1
B2 = B1 - solve(der2, der)
}
be <- B2
## we use nlm and optim to obtain the standard errors
z <- list(y = y, x = x)
qa <- nlm(medi, as.vector(be), z = z, iterlim = 1000, hessian = TRUE)
seb <- sqrt( diag( solve(qa$hessian) ) )
seb <- matrix(seb, ncol = d)
if ( is.null(xnew) ) {
est <- x %*% be
} else {
xnew <- cbind(1, xnew)
xnew <- as.matrix(xnew)
est <- xnew %*% be
}
if ( is.null(colnames(y)) ) {
colnames(seb) <- colnames(be) <- paste("Y", 1:d, sep = "")
} else colnames(seb) <- colnames(be) <- colnames(y)
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
rownames(be) <- c("constant", paste("X", 1:p, sep = "") )
rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
} else {
rownames(be) <- c("constant", colnames(x)[-1] )
rownames(seb) <- c("constant", colnames(x)[-1] )
}
runtime = proc.time() - tic
list(iter = i, runtime = runtime, beta = be, seb = seb, est = est)
}
152
p <- dim(x)[2]
d <- dim(y)[2]
medi <- function(be, y, x, p) {
be <- matrix(be, nrow = p)
est <- x %*% be
sum( sqrt( rowSums( (y - est)^2 ) ) )
}
tic <- proc.time()
be <- B2
seb <- NULL
if ( ses ) {
## we use nlm and optim to obtain the standard errors
qa <- nlm(medi, as.vector(be), y = y, x = x, p = p, iterlim = 5000,
hessian = TRUE)
seb <- sqrt( diag( solve(qa$hessian) ) )
seb <- matrix(seb, ncol = d)
if ( is.null(colnames(y)) ) {
colnames(seb) <- colnames(be) <- paste("Y", 1:d, sep = "")
} else colnames(seb) <- colnames(be) <- colnames(y)
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
153
rownames(be) <- c("constant", paste("X", 1:p, sep = "") )
rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
} else {
rownames(be) <- c("constant", colnames(x)[-1] )
rownames(seb) <- c("constant", colnames(x)[-1] )
}
if ( is.null(xnew) ) {
est <- x %*% be
} else {
xnew <- model.matrix(y ~ ., as.data.frame(xnew) )
est <- xnew %*% be
}
if ( is.null(colnames(y)) ) {
colnames(be) <- paste("Y", 1:d, sep = "")
} else colnames(be) <- colnames(y)
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
rownames(be) <- c("constant", paste("X", 1:p, sep = "") )
} else {
rownames(be) <- c("constant", colnames(x)[-1] )
}
runtime <- proc.time() - tic
list(iter = i, runtime = runtime, be = be, seb = seb, est = est)
}
154
7.6 Detecting multivariate outliers graphically with the forward search
The forward search is a way to identify multivariate outliers graphically. A possible mul-
tivariate outlier is an observation whose squared Mahalanobis distance is grater than the
χ20.975,p , where p denotes the number of dimensions. If the covariance matrix though is not
estimated robustly this can lead to the masking effect. Outliers whose effect is masked and
they are seen as not outliers. For this reason robust estimation of the covariance matrix is
necessary. The Mahalanobis distance of a multivariate observation x is given by
MD (x) = (x − µ ) T Σ −1 (x − µ ) ,
In the first step of the search a good subset must be chosen. This means that an outlier-free
subset must be found in order to provide robust estimators of some parameters. After the
subset size is determined a large number (e.g. 1000) of subsets of that size are determined.
Let n denote the number of multivariate observations and n g denote the initial subset size.
This means that there are (nng ) possible subsets. Once a good subset is determined the search
consists of n − n g steps; the number of observations that will enter the initial subset.
Many ways have been suggested in the literature so as to find the best subset with which
to start the search. The MCD is used here and the fraction required is actually chosen by
the MCD and is equal to [(n + p + 1)/2], where n and p indicate the sample size and the
number of variables or dimensions, respectively and [ x ] means the the largest integer not
greater than x. So, the idea is to estimate initially robust estimates of scatter and location and
then use these to calculate the Mahalanobis distances of the selected observations (based on
which the robust estimates are calculated). Then keep the n g observations with the smallest
155
Mahalanobis distances.
The initial subset size is another issue also. Atkinson et al. (2004) proposed a size of 3p.
However the sample size is not crucial as long as it is outlier-free. I believe that the initial
subset size should be determined taking into account the dimensions of the data matrix
(both the number of variables and the sample size). However, in the function presented
here, the default value is 20% of the sample size.
Finally, the mean and the variance of the observations in the subset are estimated. If
there are no outliers in the data, the estimates are very robust.
Given a subset of size n g observations one must find a way to progress in the search, which
is to find a way to include all the m = n − n g remaining multivariate observations. The sub-
set size is also called basic set (at each step its size is increased) and the set with all the other
data is called non-basic set (at each step its size is decreased). One good way is to calculate
the Mahalanobis distances of the observations not in the initial subset from the robust esti-
mates of scatter and location provided by the basic set and order them from the smallest to
the largest. The observation with the smallest Mahalanobis is the one to leave the non-basic
set and enter the basic set and the estimates of scatter and location are re-estimated.
The size of basic set is now n g + 1 and there are m − 1 remaining steps of the FS and
hence m − 1 observations in the non-basic set. The Mahalanobis distances of the observa-
tions in the non-basic set are calculated and ordered again in an ascending order and the
observation with the smallest distance enters the basic set. This procedure is repeated until
all observations from the non-basic set enter the basic set.
One observation is added at each step, but the inclusion of an outlier can cause the order-
ing of the Mahalanobis distances of the points not in the basic set to change. This change of
the data ordering during the FS is a feature of the multivariate data and not of the univariate
data as mentioned by Atkinson et al. (2004).
At this point we must say that this is the non standard FS. In the standard FS a point can
be included in the set at a step and be removed at a later step.
The last step of the FS involves monitoring some statistics of interest during the search
which are helpful in the identification of outliers or observations that have a larger effect
than expected. One statistic of interest could be the minimum Mahalanobis distance of the
observations not in the basic set. If the distance is large, this is an indication that an out-
lier is about to enter the basic set. If however a cluster of outliers join the set successively,
these minimum distances will decrease. Another way is to monitor the change between two
156
successive minimum Mahalanobis distances or the scaled by the determinant covariance
matrices Mahalanobis distances (Atkinson et al., 2004).
If one’s concern lies in estimating the influence of an observation in a model (multiple
regression or factor analysis for instance) then the parameter estimates, the residuals and
other goodness of fit tests are likely to be of more interest. It is true, that even a single
outlier can cause a factor analysis model to go wrong or a test of multivariate normality to
fail.
The output of the forward.ns function has two components, a) the order of entrance all
the observations and b) the minimum Mahalanobis distances of the initial step and the min-
imum Mahalanobis distances as described in step 2.
for ( j in 1:c(vim - 1) ) {
d <- Rfast::mahala( z3[, -1], mn, sn/(nt - 1) )
a <- which.min(d)
dis[j] <- min(d)
y <- z3[a, -1]
nt <- nt + 1
sn <- (sn + tcrossprod(y - mn) * (nt - 1)/ nt )
157
mn <- ( mn * (nt - 1) + y )/nt
ini <- rbind(ini, z3[a, ])
z3 <- z3[-a, ]
}
Step 1 Take m = 100 (you can increase if you like) small samples (subsets) of size 2 (this is
suggested by the authors and is used in their codes).
Step 2 Calculate the covariance matrix of these m small samples and take the product of their
diagonal elements (variances).
Step 3 Select the subset with the minimum diagonal product value.
Step 4 Calculate the Mahalanobis distances of the all the points using the mean vector and the
diagonal covariance matrix calculated from the selected subset. When I say diagonal
I mean that you only calculate the variances of the variables. No covariance terms are
calculated.
158
Step 1 Order the Mahalanobis distances and keep the h = [n/2] + 1 observations correspond-
ing to the h = [n/2 + 1] smallest distances, where [ x ] is the integer part of x.
Step 2 Use this subset and calculate the mean vector and the diagonal covariance matrix.
Step 3 Calculate the Msquared ahalanobis distances of all observations MDi2 and scale them
pMDi2
using MDi2 = .
median( MDi2 )
Step 4 Calculate
tr R2h ,
where Rh is the correlation matrix of the h observations and tr denotes the trace of a
matrix and
p2
tr R2h = tr R2h −
MDP
2
h
tr R
ĉ pn = 1 + 3/2h
p
Step 6 For the observations with positive weight repeat Steps 2-5. That is, for this selected
subset of observations, calculate the updated mean vector and covariance matrix, then
calculate the scaled squared Mahalanobis distances and calculate the weights once
more. These will be the final weights. If an observation has a zero weight it means it
is a possible outlier.
I was given the function below by Changliang Zou who is one of the authors of the
Biometrika 2015 paper I have just described (Kwangil et al., 2015), so any deeper in under-
standing questions should be addressed to him (or any other of his co-authors).
159
ty <- t(y)
for ( A in 1:itertime ) {
id <- sample(n, init_h, replace = FALSE)
ny <- y[id, ]
mu_t <- Rfast::colmeans(ny)
var_t <- Rfast::colVars(ny)
sama <- ( ty - mu_t )^2 / var_t
disa <- Rfast::colsums(sama)
crit <- 10
l <- 0
while (crit != 0 & l <= 15) {
l <- l + 1
ivec <- numeric(n)
dist_perm <- order(disa)
ivec[ dist_perm[1:h] ] <- 1
crit <- sum( abs(ivec - jvec) )
jvec <- ivec
newy <- y[dist_perm[1:h], ]
mu_t <- Rfast::colmeans(newy)
var_t <- Rfast::colVars(newy)
sama <- ( ty - mu_t )^2 / var_t
disa <- Rfast::colsums(sama)
}
tempdet <- prod(var_t)
if(bestdet == 0 | tempdet < bestdet) {
bestdet <- tempdet
final_vec <- jvec
}
}
160
mu_t <- Rfast::colmeans( y[submcd, ] )
var_t <- Rfast::colVars( y[submcd, ] )
sama <- ( ty - mu_t )^2 / var_t
disa <- Rfast::colsums(sama)
disa <- disa * p / Rfast::med(disa)
ER <- cora( y[submcd, ] )
tr2_h <- sum( ER^2 ) ## trace of ER %**% ER
tr2 <- tr2_h - p^2 / h
cpn_0 <- 1 + (tr2_h) / p^1.5
w0 <- (disa - p) / sqrt( 2 * tr2 * cpn_0 ) < qnorm(1 - delta)
nw <- sum(w0)
sub <- (1:n)[w0]
ysub <- y[sub, ]
mu_t <- Rfast::colmeans( ysub )
var_t <- Rfast::colVars( ysub )
ER <- cora( ysub )
tr2_h <- sum( ER^2 ) ## trace of ER %**% ER
tr2 <- tr2_h - p^2 / nw
sama <- ( ty - mu_t )^2 / var_t
disa <- Rfast::colsums(sama)
scal <- 1 + 1/sqrt( 2 * pi) * exp( - qnorm(1 - delta)^2 / 2 ) /
(1 - delta) * sqrt( 2 * tr2) / p
disa <- disa / scal
cpn_1 <- 1 + (tr2_h) / p^1.5
dis <- (disa - p) / sqrt(2 * tr2 * cpn_1)
wei <- dis < qnorm(1 - alpha)
runtime <- proc.time() - runtime
list(runtime = runtime, dis = dis, wei = wei)
}
The above version is the original one with some modifications to make it faster. However
if you have 10, 000 or even 40, 000 variables and not enough memory this is unlikely to work.
The reason for this is that after the for ( A in 1:itertime ) loop the correlation matrix must be
calculated, twice. For this reason I created a memory efficient version of the above. The
function below is taken directly from the package Rfast (Papadakis et al., 2019). The for loop
was converted in C++ and the part after that became memory efficient using the following
mathematical trick
p
tr ( RR) = ∑ λ2i ,
i =1
161
where R is the correlation matrix of the n × p matrix X and λi is the i-th eigenvalue of X
162
scal <- 1 + exp( - qnorm(1 - delta)^2 / 2 ) /
(1 - delta) * sqrt( tr2) / p / sqrt(pi)
disa <- disa / scal
cpn_1 <- 1 + (tr2_h) / p^1.5
dis <- (disa - p) / sqrt(2 * tr2 * cpn_1 )
wei <- dis < qnorm(1 - alpha)
####
runtime <- proc.time() - runtime
list(runtime = runtime, dis = dis, wei = wei)
}
163
8 Compositional data
Compositional data are a special type of multivariate data in which the elements of each
observation vector are non-negative and sum to a constant, usually taken to be unity. Data
of this type arise in biological settings, for instance, where the researcher is interested in the
proportion of megakaryocytes in ploidy classes. Other areas of application of compositional
data analysis include geology, where the metal composition of a rock specimen is of inter-
est; archaeometry, where the composition of ancient glasses for instance is of interest; and
economics, where the focus is on the percentage of the household expenditure allocated to
different products. Other fields are political sciences, forensic sciences, ecology and sedi-
mentology.
The main book suggested to the reader for familiarizing himself with compositional data
is Aitchison’s book (Aitchison, 2003). For more information one can look at these Lecture
notes on Compositional Data Analysis and Van Den Boogaart and Tolosana-Delgado (2013).
The functions described here exist as an R package as well Compositional Tsagris and G.
(2016).
In mathematical terms, we can define the relevant sample space as
( )
D
Sd = ( x1 , ..., x D )| xi ≥ 0, ∑ xi = 1 , (8.1)
i =1
where d = D − 1. When D = 3, the best way to visualize them is the ternary diagram (or
a three edged pyramid when D = 4), which is essentially a triangle. If we plot the simplex
in three dimensions what we will see is a two dimensional triangle, therefore a projection
to two dimensions under the unity sum constraint is convenient. The result is the already
mentioned ternary diagram. The higher the value of the component, the closer it is to the
corresponding vertex.
The columns of (8.2) represent the vertices of an equilateral triangle in the Cartesian
164
coordinates (Schnute and Haigh, 2007). In this way the length of each side of the triangle
is equal to 1. Watson and Nguyen (1985) gave a different representation of an equilateral
triangle, in which case the barycentre lies on the origin and the height of the triangle is equal
to 1, resulting in the length of the sides being greater than 1. Viviani’s theorem concerns any
point within the triangle and the three lines from that point which are perpendicular to the
sides of the triangle. The sum of the lengths of the lines is a fixed value, regardless of the
position of the point and is equal to the height of the triangle. Below we present the code to
produce a ternary plot.
The pair of coordinates of every composition in R2 after multiplying by the P matrix (8.2)
is given by
√ !
x3 x3 3
y = ( y1 , y2 ) = x2 + , (8.3)
2 2
Below is the code to produce the ternary plot with the the compositional vectors plotted in
R2 . The code plots the closed geometric mean (Aitchison, 1989) and the simple arithmetic
mean of the data as well. The closed geometric mean of a composition X is defined as
g1 gD
µ0 = ,..., , (8.4)
g1 + . . . + g D g1 + . . . + g D
where
n
gi = ∏ xij1/n , i = 1, . . . , D.
j =1
We have added an extra option, the plotting of the first principal component on S2 . Let
us see this option a bit more. If you use the package compositions this option is available
there. But here we show how it’s constructed. In addition, MASS has a function for the
ternary diagram. type ?Skye to see the dataset Skye. In the help, is the function ternary. At
first let use transform the data using the centred log-ratio transformation
!
1 D 1 D
x xD
y= log x1 − ∑ log xi , . . . , log x D − ∑ log xi = log 1 , . . . , log , (8.6)
D i =1 D i =1 g (x) g (x)
1/D
where g (x) = ∏ D j =i xi is the geometric mean of each compositional vector. Then we will
calculate the eigenvectors (V) of the covariance matrix of the centred log-ratio transformed
165
data as Aitchison (1983) suggests. We will take the first eigenvector v1 only and the mean
of the transformed data (µ̂µ ), so that the beginning of this unit vector is the not the origin
(0, 0, 0) but the mean vector (µ̂
µ ).
So the eigenvector starts from (µ̂ µ ) and has its direction pointed by its values. So this
vector has a beginning and an end, or two points on the Euclidean coordinate system which
define it. Let’s call them A (the µ̂ µ ) and B. In general a line segment on the Euclidean hyper
plane is defined by two points and a scalar
λA + (1 − λ) B.
We calculate the scores of the first principal component to see their range so that we
adjust the values of λ more or less to it. Thus, all we have to do now is choose m different
values of λ and calculate points on the straight line defined by the eigenvector. A and B have
three elements each, so in the end we will have a matrix of some rows and of 3 columns. Let’s
call this matrix Z. Now we will calculate the inverse of (8.6) for each row of Z in order to
map the line segment back into the simplex S2 .
!
e z1 j ezD j
cj = z
, . . . , z
, j = 1, . . . , m
∑kD=1 e kj ∑kD=1 e kj
The matrix C = (c1 , . . . , cm ) T contains m points of the first principal component inside
the simplex. We just have to put these points in the ternary diagram.
166
## the next code checks for zeros
ina[ rowSums(x == 0) > 0 ] <- 3
b1 <- c(1/2, 0, 1, 1/2)
b2 <- c(sqrt(3)/2, 0, 0, sqrt(3)/2)
b <- cbind(b1, b2)
plot(b[, 1], b[, 2], type = "l", xlab = " ", ylab = " ", pty = "s",
xaxt = "n", yaxt = "n", bty = "n")
proj <- matrix(c(0, 1, 1/2, 0, 0, sqrt(3)/2), ncol = 2)
d <- x %*% proj
points( d[1:n, 1], d[1:n, 2], col = ina )
text( b[1, 1], b[1, 2] + 0.02, nam[3], cex = 1 )
text( b[2:3, 1], b[2:3, 2] - 0.02, nam[1:2], cex = 1 )
if ( means ) {
## should the means appear in the plot?
points( d[c(n + 1), 1], d[c(n + 1), 2], pch = 2, col = 2 )
points( d[c(n + 2), 1], d[c(n + 2), 2], pch = 3, col = 3 )
legend(0.57, 0.9, c("closed geometric mean"," arithmetic mean"),
pch = c(2, 3), col = c(2, 3), bg = ’gray90’)
}
if (pca & min(x) > 0 ) {
## should the first principal component appear?
zx <- log(x[1:n, ])
z <- zx - Rfast::rowmeans( zx ) ## clr transformation
m <- Rfast::colmeans(z) ## mean vector in the clr space
a <- eigen( cov(z) )$vectors[, 1] + m ## move the unit vector a bit
sc <- z %*% a
lam <- seq( min(sc) - 1.5, max(sc) + 1.5, length = n )
x1 <- cbind( a[1] * lam, a[2] * lam, a[3] * lam) + cbind( m[1] * (1 - lam),
m[2] * (1 - lam), m[3] * (1 - lam) )
expx1 <- exp(x1)
wa1 <- expx1 / Rfast::rowsums( expx1 ) ## first principal component in S^2
wa <- wa1 %*% proj
lines(wa, lwd = 2, lty = 2)
}
mu <- rbind(m1, m2)
rownames(mu) <- c("closed geometric", "arithmetic mean")
colnames(mu) <- nam
mu
167
8.1.2 Log-ratio transformations
The Dirichlet distribution (8.9) we will see later is a natural parametric model on the simplex
but not very rich though. Alternative distributions are the multivariate normal and skew
normal and the multivariate t distribution. We will show two transformation which allow
us to map Sd onto Rd .
Aitchison (2003) suggested a log-ratio transformation for compositional data. He termed
it additive log-ratio transformation and is the generalised logistic transformation
x x
y= log 1 , . . . , log d , (8.7)
xD xD
where x D indicates the last component (any other component can play the role of the com-
mon divisor). Another log-ratio transformation we saw before, also suggested by Aitchison
(1983) is the centred log-ratio transformation (8.6). The additive log-ratio transformation
maps the data from Sd to Rd , in contrast to the centred log-ratio transformation (8.6) which
maps the Sd onto Qd
( )
D
Qd = T
( x1 , ..., x D ) : ∑ xi = 0 .
i =1
However, if we left multiply the centred log-ratio transformation (8.6) by the Helmert
sub-matrix (6.2) the result is the isometric log-ratio transformation (Egozcue et al., 2003)
which maps the data from Qd onto Rd .
z = Hy (8.8)
The multiplication by the Helmert matrix is often met in shape analysis and it was ap-
plied also in simplex shape spaces by Le and Small (1999). It was also known to Aitchison
(2003) who knew the relationship between the covariance matrix of (8.6) and (8.8) trans-
formations. In fact, the multiplication by the Helmert sub-matrix leads to what he called
standard orthogonal contrasts.
We will skip the technical details here and just say that the road is open now to fit multi-
variate distributions whose support is the whole of Rd . To be more accurate, we also need
the Jacobians of the log-ratio transformations, but in the contour plot we will not use them.
For more information the reader is addressed to Aitchison (2003) and Pawlowsky Glahn
et al. (2007). We can apply either the additive log-ratio transformation (8.7) or the isomet-
ric log-ratio transformation (8.8) and in the transformed data fit a multivariate distribution
defined in Rd .
168
8.2 Estimating location and scatter parameters for compositional data
I provide a general function which allows for fitting a multivariate normal, t and skew-
normal distribution to compositional data and hence estimating their parameters. I left the
multivariate skew-t outside because of its complexity. In addition robust estimation of the
mean and covariance matrix via the MCD method are offered. Sharp (2006) used the graph
median as a measure of central tendency for compositional data. We will provide a function
to calculate the spatial median instead of the graph median, along with the spatial sign
covariance matrix. We saw the spatial median function in Section 7.4. In all cases, either the
additive log-ratio (8.7) or the isometric log-ratio transformation (8.8) can be used.
169
mu <- mu / sum(mu)
s <- mod$scatter
dof <- mod$dof
} else {
y <- alfa(x, 0)
mod <- multivt(y)
m <- mod$center
mu <- alfainv(m, 0)
s <- mod$scatter
dof <- mod$dof
}
result <- list(mean = m, comp.mean = mu, covariance = s, dof = dof)
170
} else {
y <- alfa(x, 0)
delta <- spat.med(y)
comp.delta <- alfainv(delta, 0)
s <- sscov(y, delta)
}
result <- list(spat.med = delta, comp.spat.med = comp.delta, ssc = s)
result
}
1 D α i −1
B (α ) ∏
f ( x1 , . . . , x D ; α1 , . . . , α D ) = xi (8.9)
i =1
171
where
∏iD=1 Γ (αi )
B (α ) = and α = (α1 , . . . , α D )
Γ ∑iD=1 αi
In the next two section we see how to estimate the parameters of the Dirichlet distribution.
• Classical MLE. We can use the ”optim” function to maximize the log-likelihood. The
argument ”hessian=T” we will see in the function diri.est calculates the hessian matrix
and the inverse of the hessian matrix serves as the observed information matrix of the
parameters. This way can also be found at the package VGAM (Yee, 2010). The extra
feature offered by the package is the ability to include covariates.
n D n D
` = n log Γ (φ) − ∑ ∑ log Γ (φai∗ ) + ∑ ∑ (φai∗ − 1) log xij , (8.11)
j =1 i =1 j =1 i =1
• Estimation via the entropy. We will make use of the following relationship
172
where ψ is the digamma function defined as
d Γ0 ( x ) D
ψ (x) = log Γ ( x ) = and α0 = ∑ αi
dx Γ (x) i =1
Instead of trying to maximize the log-likelihood of the Dirichlet distribution we will try
to solve the k simultaneous equations imposed by 8.12. If you notice, these are just the
first derivatives of the log-likelihood with respect to each of the parameters. In other
words, their are the score statistics, since the expectation is in the game. I then opened
up a book I have by Ng et al. (2011) about the Dirichlet distribution and I saw that they
show that this approach is the generalised method of moments (GMM). No matter
what the method is called, we will use the package BB (Varadhan and Gilbert, 2009). A
disadvantage of the ”entropy style” estimation is that the log-likelihood maximization
is very stable and you can compare the results with the package VGAM (Yee, 2010).
Below is the code offering all three options, classical MLE, MLE with the precision pa-
rameter φ and via the entropy. For the classical MLE type I take the exponential of the
parameters, to avoid negative solutions. In the alternative parameterization I take the expo-
nential of the φ and the other parameters. This is a classical trick for such cases. Instead of
using constrained optimisation, to avoid negative values, use the exponential.
173
diriphi <- function(param, z) {
phi <- exp(param[1])
b <- c(1, exp(param[-1]) )
b <- b / sum(b)
f <- -( n * lgamma(phi) - n * sum( lgamma(phi * b) ) +
sum( z %*% (phi * b - 1) ) )
f
}
if (type == ’mle’) {
runtime <- proc.time()
oop <- options(warn = -1)
on.exit(options(oop))
da <- nlm(loglik, colMeans(x) * 10, z = z, iterlim = 10000)
da <- nlm(loglik, da$estimate, z = z, iterlim = 10000)
da <- nlm(loglik, da$estimate, z = z, iterlim = 10000)
da <- optim(da$estimate, loglik, z = z, control = list(maxit = 2000),
hessian = TRUE)
if (type == ’prec’) {
runtime <- proc.time()
oop <- options(warn = -1)
on.exit(options(oop))
da <- nlm(diriphi, c(10, colMeans(x)[-1]), z = z, iterlim = 2000)
174
da <- nlm(diriphi, da$estimate, z = z, iterlim = 2000)
da <- nlm(diriphi, da$estimate, z = z, iterlim = 2000, hessian = TRUE)
da <- optim(da$estimate, diriphi, z = z, control = list(maxit = 3000),
hessian = TRUE)
phi <- exp(da$par[1])
a <- c( 1, exp(da$par[-1]) )
a <- a/sum(a)
if (type == ’ent’) {
runtime <- proc.time()
## this requires the BB package
ma <- colMeans(z)
da <- BB::BBsolve(colMeans(x) * 10, entro, control =
list(maxit = 20000, tol = 1e-10))
da <- BBsolve( da$par, entro, control = list(maxit = 20000, tol = 1e-10) )
da <- BBsolve( da$par, entro, control = list(maxit = 20000, tol = 1e-10) )
da <- BBsolve( da$par, entro, control = list(maxit = 20000, tol = 1e-10) )
param <- exp(da$par)
lik <- n * lgamma( sum(param) ) - n * sum( lgamma(param) ) +
sum( z %*% (param - 1) )
result
}
Let me now do the same calculations using the Newton-Raphson algorithm as suggested
by Minka (2000). Apart from the algorithm for estimating the parameters of the Dirichlet-
multinomial distribution (see §5.1.9), Minka (2000) described the whole Newton-Rapshon
algorithm for the Dirichlet distribution, but we only need some lines from that. If you want
to know more, read his technical report. All you need to know, the resume, is the final
175
Newton-Raphson algorithm given by
g − Jpb
α new = α old − ,
q
where
!
p n
gk = nψ ∑ αk − nψ (αk ) + ∑ log xik , k = 1, . . . , p
i =k i =1
0
qk = −nψ (αk ) , k = 1, . . . , p
p
∑ j=1 g j /q j
!
p
b = p
1/z + ∑ j=1 1/q j
and z = nψ0 ∑ αk .
i =k
Γ0 (t)
The function ψ (t) = Γ(t) is the digamma function and ψ0 (t) is its derivative. The J is the
p-dimensional vector of ones, since we have p dimensions and n is the sample size of the
compositional data X. The initial value for the precision parameter, as suggested by Minka
(2000) is given by
( D − 1) /2
φ0 = xij .
− ∑iD=1 x̄ni n1
∑nj=1 log x̄i
Hence, initial estimates for α are given by α ini = φ0 ( x̄1 , . . . , x̄1 ). This makes the estimation
a bit faster. I tried using the previous function (diri.est) with these initial values, but it was
much slower(!!).
The Newton-Raphson algorithm was implemented in (8.12) and the results are equally
fast. The truth, is that these two are exactly the same, the difference is that ?? did a very
good job in simplifying the calculations to vectors only, i.e. no matrices are involved and
hence no matrix inversions.
If you want to see more codes for the Dirichlet distributions check the S-plus/R Codes
used in the book Dirichlet and Related Distributions: Theory, Methods and Applications
by Ng et al. (2011). The only problem I have seen with this method is that if the data are
concentrated around a point, say the center of the simplex, it will be hard for this and the
previous methods to give estimates of the parameters. In this extremely difficult scenario I
would suggest the use of the previous function with the precision parametrisation diri.est(x,
type = ”prec”). It will be extremely fast and accurate. Another option is the the function
offered by the R package VGAM vglm(x 1, dirichlet) (Yee, 2010).
176
runtime <- proc.time()
x <- as.matrix(x) ## makes sure x is a matrix
x <- x / Rfast::rowsums(x) ## makes sure x is compositional data
n <- dim(x)[1] ## the sample size
p <- dim(x)[2] ## dimensionality
m <- Rfast::colmeans(x)
zx <- t( log(x) )
down <- - sum( m * ( Rfast::rowmeans( zx ) - log(m) ) )
sa <- 0.5 * (p - 1) / down ## initial value for precision
a1 <- sa * m ## initial values
gm <- Rfast::rowsums(zx)
z <- n * digamma( sa )
g <- z - n * digamma(a1) + gm
qk <- - n * trigamma(a1)
b <- ( sum(g / qk) ) / ( 1/z - sum(1 / qk) )
a2 <- a1 - (g - b)/qk
i <- 2
} else if (type == 2) {
177
p <- dim(x)[2]
zx <- t( log(x) )
ma <- Rfast::rowmeans(zx)
m <- Rfast::colmeans(x)
down <- - sum( m * ( ma - log(m) ) )
sa <- 0.5 * (p - 1) / down ## initial value for precision
a1 <- sa * m ## initial values
f <- ma - digamma(a1) + digamma( sa )
f2 <- matrix( trigamma(sa), p, p)
diag(f2) <- diag(f2) - trigamma(a1)
a2 <- a1 - solve(f2, f)
i <- 2
}
loglik <- n * lgamma( sum(a2) ) - n * sum( lgamma(a2) ) +
sum( zx * (a2 - 1) )
runtime <- proc.time() - runtime
}
if ( is.null(colnames(x)) ) {
names(a2) <- paste("X", 1:p, sep = "")
} else names(a2) <- colnames(x)
The symmetric Dirichlet distribution arises when all of its parameters are equal. To test
this assertion we will use the log-likelihood ratio test statistic. The relevant R code is given
below
178
sym.test <- function(x) {
## x contains the data
n <- dim(x)[1] ## the sample size
D <- dim(x)[2] ## the dimensionality of the data
zx <- log(x)
We show a function to calculate the Kullback-Leibler divergence between two Dirichlet dis-
tributions. The proof of the Kullback-Leibler divergence between Dir ( a) and Dir (b) is
available at this technical report. This divergence is equal to
D D
Γ ( bi ) Γ ( a0 )
KL ( D1 ( a) k D2 (b)) = ∑ (ai − bi ) [Ψ (ai ) − Ψ (a0 )] + ∑ log Γ ( ai )
+ log
Γ (b0 )
,
i =1 i =1
179
we will give below the code to calculate the Bhattacharyya ditance between two Dirichlet
distributions. The Bhattacharyya distance between two Dirichlet distributions is defined as
!
D
a i + bi 1 D
JB ( D1 ( a) , D2 (b)) = log Γ ∑ + ∑ [log Γ ( ai ) + log Γ (bi )]
i =1
2 2 i =1
" ! !#
D D D
a i + bi 1
− ∑ log Γ − log Γ ∑ ai + log Γ ∑ bi (8.13)
i =1
2 2 i =1 i =1
What the user has to do is to fit a parametric model (Dirichlet distributions for example, or
the normal, t or skew normal distribution in the log-ratio transformed data) and estimate
the parameters. Then add a couple of extra lines to all the next functions where he plots his
compositional data.
180
We take a grid of points in R2 and see if it lies within the triangle (or the ternary plot seen
in (8.1.1)). If it lies, then it comes from a composition. To find the composition we need to
work out the opposite of (8.3). The coordinates of a compositional vector in R2 taken from
(8.3) are
√ !
x3 x3 3
( y1 , y2 ) = x2 + , .
2 2
We have the pair (y1 , y2 ) and want to calculate ( x1 , x2 , x3 ) at first. The result is
2y2
x3 =
√
3
y
x2 = y1 − √2
3
x = 1−x −x
1 2 3
Thus ( x1 , x2 , x3 ) ∈ S2 when (y1 , y2 ) fall within the interior of the triangle. If you plot√the
ternary plot from section 8.1.1 you will see that the top of the triangle is located at 0.5, 23
and the other two vertices are located at (0, 0) and (1, 0) given in (8.2). Thus, the three lines
which define the triangle are
y2 = 0 with 0 ≤ y1 ≤ 1
√
y2 = 3y1 with 0 ≤ y1 ≤ 0.5
√ √
y2 = 3 − 3y1 with 0.5 ≤ y1 ≤ 1.
Thus, only the points inside the interior of the triangle come from a composition. Once we
have calculated ( x1 , x2 , x3 ) from the pair of ys which lie inside the interior of the triangle we
will plug them in (8.9). In this way we will calculate the density of the Dirichlet with some
given parameter (estimated or not) at that point. We will do this for all points and in the end
we will plot the contour lines along with the triangle. There is the option to plot the data as
well. The code is given below.
181
for ( i in 1:c(n/2) ) {
for (j in 1:n) {
if ( x2[j] < sqrt3 * x1[i] ) {
## This checks if the point will lie inside the triangle
## the next three lines invert the points which lie inside
## the triangle back into the composition in S^2
w3 <- 2 * x2[j]/sqrt3
w2 <- x1[i] - x2[j]/sqrt3
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
can <- (1 / beta) * prod( w^(a - 1) )
if (abs(can) < Inf) mat[i, j] <- can else mat[i, j] <- NA
} else mat[i, j] <- NA
}
}
for (i in c(n/2 + 1):n) {
for (j in 1:n) {
## This checks if the point will lie inside the triangle
if ( x2[j] < sqrt3 - sqrt3 * x1[i] ) {
## the next three lines invert the points which lie inside
## the triangle back into the composition in S^2
w3 <- 2 * x2[j]/sqrt3
w2 <- x1[i] - x2[j]/sqrt3
w1 <- 1 - w2 - w3
w <- round(c(w1, w2, w3), 6)
can <- (1 / beta) * prod(w^(a - 1))
if (abs(can) < Inf) mat[i, j] <- can else mat[i, j] <- NA
} else mat[i, j] <- NA
}
}
182
if ( !is.null(x) ) {
x <- as.matrix(x)
x <- x / Rfast::rowsums(x)
proj <- matrix(c(0, 1, 1/2, 0, 0, sqrt3/2), ncol = 2)
xa <- x %*% proj
points(xa[, 1], xa[, 2])
}
We will repeat Section 8.4.1 with the only difference that we will give the code for the contour
plot of the bivariate multivariate normal distribution. The idea is the same, we choose a grid
of points and for each pair of points we see whether it falls within the triangle. If yes, we
calculated the density of the bivariate normal at that point by plugging it at (8.14). There is
the option to make the data appear or not.
if (type == "alr") {
ya <- log( x[, -3] / x[, 3] )
183
} else {
ya <- log(x)
ya <- ya - Rfast::rowmeans( ya )
ya <- as.matrix( ya %*% ha )
}
for ( i in 1:c(n/2) ) {
for ( j in 1:n ) {
if ( x2[j] < sqrt3 * x1[i] ) {
## This checks if the point will lie inside the triangle
## The next 4 lines calculate the composition
w3 <- 2 * x2[j] / sqrt3
w2 <- x1[i] - x2[j] / sqrt3
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
if (type == "alr") {
y <- log( w[-3] / w[3] ) ## additive log-ratio transformation
} else {
y <- log(w) - mean( log(w) )
y <- as.vector( y %*% ha )
} ## isometric log-ratio transformation
can <- down * exp( -0.5 * ( ( y - m ) %*% st %*% ( y - m ) ) )
if (abs(can) < Inf) mat[i, j] <- can
}
}
}
184
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
if (type == "alr") {
y <- log( w[-3] / w[3] ) ## additive log-ratio transformation
} else {
y <- log(w) - mean( log(w) )
y <- as.vector( y %*% ha )
} ## isometric log-ratio transformation
can <- down * exp( -0.5 * ( ( y - m ) %*% st %*% ( y - m ) ) )
if (abs(can) < Inf) mat[i, j] <- can
}
}
}
contour( x1, x2, mat, nlevels = 7, col = 3, pty = "s", xaxt = "n",
yaxt = "n", bty = "n" )
b1 <- c( 1/2, 0, 1, 1/2 )
b2 <- c( sqrt3/2, 0, 0, sqrt3/2 )
b <- cbind(b1 ,b2)
points( b[, 1], b[, 2], type = "l", xlab = " ", ylab = " " )
nam <- colnames(x)
if ( is.null(nam) ) nam <- paste("X", 1:3, sep = "")
text( b[1, 1], b[1, 2] + 0.01, nam[3], cex = 1)
text( b[2:3, 1] + 0.01, b[2:3, 2] - 0.01, nam[1:2], cex = 1)
if ( appear ) {
proj <- matrix(c(0, 1, 1/2, 0, 0, sqrt(3)/2), ncol = 2)
x <- as.matrix(x)
x <- x/rowSums(x)
xa <- x %*% proj
points(xa[, 1], xa[, 2])
}
The density of the multivariate t distribution is given in (5.1). After applying the additive
log-ratio (8.7) or the isometric log-ratio transformation (8.8) to the compositional data we
185
can estimate the parameters of the multivariate t distribution via numerical optimization.
In Section 5.1.3 we provided a function to perform this task.
The way to produce a contour plot of the bivariate t distribution on the simplex is similar
to the normal distribution. The code is given below. There is the option to make the data
appear or not.
if (type == ’alr’) {
y <- log( x[, -3] / x[, 3] ) ## additive log-ratio transformation
} else {
ha <- t( helm(3) )
y <- log(x)
y <- y - Rfast::rowmeans( y )
y <- as.matrix( y %*% ha )
}
for (i in 1:c(n/2) ) {
for (j in 1:n) {
186
if (x2[j] < sqrt3 * x1[i]) { ## This checks if the point lies
## inside the triangle
## The next 4 lines calculate the composition
w3 <- 2 * x2[j] / sqrt3
w2 <- x1[i] - x2[j] / sqrt3
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
if (type == ’alr’) {
y <- log(w[-3] / w[3]) ## additive log-ratio transformation
} else {
y <- log(w) - mean(log(w))
y <- as.vector( y %*% ha )
} ## isometric log-ratio transformation
}
}
}
if (type == ’alr’) {
y <- log(w[-3]/w[3]) ## additive log-ratio transformation
} else {
y <- log(w) - mean(log(w))
187
y <- as.vector( y %*% ha )
} ## isometric log-ratio transformation
if ( appear ){
nam <- colnames(x)
if ( is.null(nam) ) nam <- paste("X", 1:3, sep = "")
text(b[1, 1], b[1, 2] + 0.02, nam[3], cex = 1)
text(b[2:3, 1], b[2:3, 2] - 0.02, nam[1:2], cex = 1)
proj <- matrix(c(0, 1, 1/2, 0, 0, sqrt3/2), ncol = 2)
x <- as.matrix(x) ; x <- x/rowSums(x)
xa <- x %*% proj
points(xa[, 1], xa[, 2])
}
In order to fit the skew-normal distribution (5.9) to a compositional dataset we first apply
either the additive log-ratio (8.7) or the isometric log-ratio transformation (8.8). Using the
transformed data we need to estimate the parameters of the skew-normal distribution.
The code to produce a contour plot for the bivariate skew-normal distribution on the
simplex is given below. There is also the option to make the data appear or not.
188
## the type parameter determines whether the additive
## or the isometric log-ratio transformation will be used.
## If type=’alr’ (the default) the additive log-ratio transformation is used.
## If type=’ilr’, the isometric log-ratio is used
## n is the number of points of each axis used
x <- as.matrix(x)
x <- x / rowSums(x)
ha <- t( helm(3) )
if (type == "alr") {
ya <- log( x[, -3] / x[, 3] )
} else {
ya <- log(x)
ya <- ya - Rfast::rowmeans( ya )
ya <- as.matrix( ya %*% ha )
}
sqrt3 <- sqrt(3)
mod <- sn::msn.mle(y = ya)
param <- mod$dp
x1 <- seq(0.001, 0.999, length = n)
x2 <- seq(0.001, sqrt3/2 - 0.001, length = n)
mat <- matrix(nrow = n, ncol = n)
for ( i in 1:c(n/2) ) {
for ( j in 1:n ) {
## This checks if the point lies inside the triangle
if ( x2[j] < sqrt3 * x1[i] ) {
## The next 4 lines calculate the composition
w3 <- 2 * x2[j] / sqrt3
w2 <- x1[i] - x2[j] / sqrt3
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
if (type == ’alr’) {
y <- log(w[-3] / w[3]) ## additive log-ratio transformation
} else {
y <- log(w) - mean(log(w))
y <- as.vector( y %*% ha )
} ## isometric log-ratio transformation
can <- sn::dmsn(y, dp = param)
189
if ( abs(can) < Inf ) mat[i, j] <- can
}
}
}
for ( i in c(n/2+1):n ) {
for ( j in 1:n ) {
## This checks if the point will lie inside the triangle
if ( x2[j] < sqrt3 - sqrt3 * x1[i] ) {
## The next 4 lines calculate the composition
w3 <- 2 * x2[j] / sqrt3
w2 <- x1[i] - x2[j] / sqrt3
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
if (type == ’alr’) {
y <- log(w[-3] / w[3]) ## additive log-ratio transformation
} else {
y <- log(w) - mean(log(w))
y <- as.vector( y %*% ha )
} ## isometric log-ratio transformation
can <- sn::dmsn(y, dp = param)
if ( abs(can) < Inf ) mat[i, j] <- can
}
}
}
if ( appear ) {
nam <- colnames(x)
if ( is.null(nam) ) nam <- paste("X", 1:3, sep = "")
text(b[1, 1], b[1, 2] + 0.02, nam[3], cex = 1)
text(b[2:3, 1], b[2:3, 2] - 0.02, nam[1:2], cex = 1)
proj <- matrix(c(0, 1, 1/2, 0, 0, sqrt3/2), ncol = 2)
190
x <- as.matrix(x) ; x = x/rowSums(x)
xa <- x %*% proj
points(xa[, 1], xa[, 2])
}
We need again the R package mixture (Browne et al., 2015) for the contour plots.
for ( i in 1:c(n/2) ) {
for (j in 1:n) {
if ( x2[j] < sqrt3 * x1[i] ) {
191
## This checks if the point will lie inside the triangle
## The next 4 lines calculate the composition
w3 <- 2 * x2[j] / sqrt3
w2 <- x1[i] - x2[j] / sqrt3
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
if (type == "alr") y <- log( w[-3]/w[3] ) ## alr transformation
if (type == "ilr") { ## isometric log-ratio transformation
y <- log(w) - mean(log(w))
y <- as.vector( y %*% ha )
}
ta <- numeric(g)
for (k in 1:g) {
ta[k] <- ldet[k] - 0.5 * mahalanobis(y, mu[k, ], su[, , k])
}
can <- sum( prob * exp(ta) )
if (abs(can) < Inf) mat[i, j] <- can
}
}
}
192
can <- sum( prob * exp(ta) )
if ( abs(can) < Inf ) mat[i, j] <- can
}
}
}
contour( x1, x2, mat, nlevels = 7, col = 3, pty = "s", xaxt = "n",
yaxt = "n", bty = "n" )
b1 <- c(1/2, 0, 1, 1/2)
b2 <- c( sqrt3/2, 0, 0, sqrt3/2 )
b <- cbind(b1, b2)
points(b[ , 1], b[ , 2] , type = "l", xlab = " ", ylab = " ")
The idea is the same as before, but a bit different now. Instead of providing the parameters
of a distribution, the user provides the dataset itself and decides whether the additive or
the isometric log-ratio transformation is to be used. Then, the best bandwidth parameter
h is obtained via tuning (see Section 5.1.6)and then for a grid of points the kernel density
estimation takes place. The difference with the previous contour plots, is that now we can
see the data plotted on the simplex, along with the contours of the kernel density estimate.
193
x <- as.matrix(x)
x <- x / Rfast::rowsums(x) ## makes sure x is compositional data
nu <- dim(x)[1] ## sample size
sqrt3 <- sqrt(3)
ha <- t( helm(3) )
for ( i in 1:c(n/2) ) {
for ( j in 1:n ) {
if (x2[j] < sqrt3 * x1[i]) {
## This checks if the point will lie inside the triangle
## The next 4 lines calculate the composition
w3 <- 2 * x2[j] / sqrt3
w2 <- x1[i] - x2[j]/sqrt3
w1 <- 1 - w2 - w3
w <- c(w1, w2, w3)
if (type == "alr") y <- log(w[-3]/w[3]) ## alr transformation
if (type == "ilr") { ## isometric log-ratio transformation
y <- log(w) - mean(log(w))
y <- as.vector(y %*% ha )
}
a <- Rfast::mahala(z, y, ts)
can <- 1/(2 * pi) * (1/con) * sum( exp(-0.5 * a) )/nu
if ( abs(can) < Inf ) mat[i, j] <- can
194
}
}
contour( x1, x2, mat, nlevels = 7, col = 3, pty = "s", xaxt = "n",
yaxt = "n", bty = "n" )
proj <- matrix(c(0, 1, 1/2, 0, 0, sqrt3/2), ncol = 2)
da <- x %*% proj
points(da[, 1], da[, 2])
b1 <- c(1/2, 0, 1, 1/2)
b2 <- c(sqrt3/2, 0, 0, sqrt3/2)
b <- cbind(b1, b2)
points(b[, 1], b[, 2], type = "l", xlab = " ", ylab = " ")
nam <- colnames(x)
if ( is.null(nam) ) nam <- paste("X", 1:3, sep = "")
text( b[1, 1], b[1, 2] + 0.02, nam[3], cex = 1 )
text( b[2:3, 1], b[2:3, 2] - 0.02, nam[1:2], cex = 1 )
195
8.5 The α-transformation for compositional data
This Section is about a recently suggested Box-Cox type power transformation for compo-
sitional data termed the α-transformation (Tsagris et al., 2011). Discriminant analysis and
regression using this transformation will be covered, but I decided to put everything related
to the α-transformation in one Section.
D uα (x) − 1 D
zα (x) = H · , (8.15)
α
where
!T
x1α x αD
uα (x) = ,..., (8.16)
∑D
j =1 x j
α
∑D
j =1 x j
α
D xiα−1
|J| = Dd ∏ D
. (8.17)
i =1 ∑ j =1 x j
α
where
x11/α x1/α
u− 1
α (x) =
1/α
,..., D
1/α
.
∑D
j =1 x j ∑D
j =1 x j
196
If one is willing to exclude from the sample space the boundary of the simplex, which
corresponds to observations that have one or more components equal to zero, then the α-
transformation (8.15) and its inverse (8.18) are well defined for all α ∈ R. (Excluding the
boundary is standard practise in LRA because the definition is used to sidestep the problem
of having data with zeros.) The motivation for the α-transformation (8.15) is that the case
α = 0 corresponds to LRA, since at the limit as α → 0, (8.15) tends to (8.8). The case of
α = 1 corresponds to a analysing compositional data as if they were Euclidean (Baxter, 2001,
Baxter et al., 2005, Woronow, 1997). In this case α = 1, (8.15) is just a linear transformation
of the simplex Sd . Thus, (8.15) is a more general transformation than the isometric (or the
centred) log-ratio one.
Power transformations similar to (8.15) were considered by Greenacre (2009) and Greenacre
(2011), in the somewhat different context of correspondence analysis. A Box–Cox transfor-
mation applied to each component of x ∈ Sd so that x is transformed to
h i T
θ −1 x1θ − 1 , . . . , θ −1 x θD − 1 , (8.19)
has the limit (log x1 , . . . , log x D ) T as θ → 0. We favour transformation (8.15) in view of its
closer connection, via (8.8), to Aitchison’s centred logratio transformation. In addition, the
α-transformation can be defined even in the case of zero values present, but in that case α
must be non-negative (α > 0). Different values of α might lead to better results, but the
problem is that if for some components these values go to zero, the transformation ”breaks
down”. The two functions below calculate the α-transformation
and its inverse .Note that
n D
the alfa function also calculates the term ∑i=1 log ∑ j=1 xij which is part of the Jacobian
α
determinant (8.17). I do this, because in the function profile is is required. This was the
function profile is faster.
197
} else { ## if a=0 the ilr is calculated
xa <- log(x)
z <- xa - Rfast::rowmeans( xa ) ## this is the clr
sa <- dim(x)[1] * log(D)
}
if ( h ) {
aff <- tcrossprod(z, helm( D ) ) ## multiply by the Helmert sub-matrix
res <- list(sa = sa, aff = aff)
} else res <- list(sa = sa, aff = z)
res
}
198
8.5.2 The α-distance
Note, that (8.20) is simply the Euclidean distance applied to the α-transformed data. Also,
as α → 0, (8.20) tends to the Euclidean distance applied to the centred log-ratio transformed
data (Aitchison, 1983)
"
D 2 #1/2
x x
∆0 (x, w) = ∑ log i − log i
g (x) g (x)
, (8.21)
i =1
where g (x) is the geometric mean of x we saw in (8.6). So this means, that in this case,
the centred log-ratio transformation is applied to both compositional vectors and then Eu-
clidean distance is calculated. If the isometric log-ratio transformation (8.8) the result would
be the same, because we said before that the name isometric comes from the fact, that the
distances remain the same.
Associated with this one-parameter family of distances (8.20) is the family of Fréchet means
(Tsagris et al., 2011)
!1/α
1 n xijα
µ( α ) =C
n ∑ ∑D xkj
α
. (8.22)
j =1 k =1
i =1,...,D
This agrees with (8.5) when α = 1 and with (8.4) when α = 0. Now you can go to the
ternary function we saw before and add the Fréchet mean as well to see how it works. For
an example of this with real and simulated data see Tsagris et al. (2011).
199
frechet <- function(x, a) {
## x contains the compositional data
## a is the power parameter, usually between -1 and 1
if (a == 0) {
m1 <- exp( Rfast::colmeans(Rfast::Log(x)) )
m <- m1 / sum( m1 ) ## closed geometric mean
} else {
xa <- x^a
z <- xa / Rfast::rowsums(xa)
m1 <- Rfast::colmeans(z) ^ ( 1 / a )
m <- m1 / sum(m1) ## frechet mean in general
}
m
}
Similarly to the Box-Cox transformation the most classical situation of choosing the value
of the transformation parameter is through maximisation of the profile log-likelihood of the
parameter. The most widely used multivariate parametric model is the multivariate normal.
The assumption we impose onto the data is that after the α-transformation they can be
modelled by a multivariate normal distribution. The two versions lead to equivalent ver-
sions of the multivariate normals. The density of the d-multivariate normal after the α-
transformation is
(2π )−d/2
1 T −1 0
Σ
f (Bα (x)) = 1/2
exp − ( B α − B̄ α ) α ( B α − B̄ α ) J ( x ) α
, (8.23)
|Σ α | 2
where (|J0α |) is the Jacobian determinant of the α-transformation (8.17). So, essentially, (8.23)
is simply the density of the multivariate normal with an extra part, the Jacobian determinant.
What remains now is to maximize the log-likelihood version of (8.23) with respect to α. This
task is feasible via two ways; either by using an optimisation algorithm such as Nelder-
Mead (Nelder and Mead, 1965) (using the command optim), or by evaluating the profile
log-likelihood of α for a range of values of α
Step 2. Calculate the sample mean vector and sample covariance matrix of the transformed
data.
Step 3. Evaluate the log-likelihood at the sample estimates of the mean vector and covariance
matrix.
200
Step 4. Repeat Steps 1-3 for a range of values of α.
for ( i in 1:length(a) ) {
trans <- alfa( x, a[i] )
aff <- trans$aff ## the alpha-transformation
sa <- trans$sa ## part of the Jacobian determinant as well
qa[i] <- - n/2 * log( abs( det( cov(aff) ) ) ) + (a[i] - 1) * ja - D * sa
}
qa <- qa + con
## the green lines show a 95% CI for the true value of
## alpha using a chi-square distribution
b <- max(qa) - qchisq(0.95, 1)/2
plot(a, qa, type = "l", xlab = expression( paste(alpha, " values", sep = "") ),
ylab = "Profile log-likelihood")
abline(h = b, col = 2)
ci <- c( min(a[qa >= b]), max(a[qa >= b]) )
names(ci) <- paste(c("2.5", "97.5"), "%", sep = "")
abline(v = ci[1], col = 3, lty = 2)
abline(v = ci[2], col = 3, lty = 2)
201
}
Below is a faster function, which allows for bootstrap confidence intervals (percentile
method as Efron and Tibshirani (1993) calls it) as well. Parallel computation is an option to
be used preferably with big samples and not so much when there are many components.
Parallel computation is advised in large sample sizes, many components and or combina-
tions of both.
x <- as.matrix(x)
x <- x / Rfast::rowsums(x)
n <- dim(x)[1] ## sample size
f <- (n - 1) / n
D <- dim(x)[2] ## number of components
d <- D - 1 ## dimensionality of the simplex
ja <- sum( log(x) ) ## part of the Jacobian of the alpha transformation
con <- -n / 2 * d * log(2 * pi) - (n - 1) * d/2 + n * (d + 1/2) * log(D)
pa <- function(a, x) {
trans <- alfa(x, a)
z <- trans$aff ## the alpha-transformation
sa <- trans$sa ## part of the Jacobian determinant as well
-n/2 * log( abs( det( f * cov(z) ) ) ) + (a - 1) * ja - D * sa
}
if (B == 1) {
ell <- optimize(pa, c(-1, 1), x = x, maximum = TRUE )
aff0 <- alfa(x, 0)
z0 <- aff0$aff
sa <- aff0$sa ## part of the Jacobian determinant as well
lik0 <- con - n/2 * log( abs( det( f * cov(z0) ) ) ) -
ja - D * sa
result <- c(ell$maximum, ell$objective + con, lik0)
names(result) <- c("best alpha", "max log-lik", "log-lik at 0")
202
ab <- numeric(B)
if (ncores == 1) {
runtime <- proc.time()
for (i in 1:B) {
ind <- sample(1:n, n, replace = TRUE)
ab[i] <- optimize(pa, c(-1, 1), x = x[ind, ], maximum = TRUE )$maximum
}
runtime <- proc.time() - runtime
} else {
runtime <- proc.time()
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
ww <- foreach::foreach( i = 1:B, .combine = rbind,
.export = c("alfa", "helm") ) %dopar% {
ind <- sample(1:n, n, replace = TRUE)
ab[i] <- optimize(pa, c(-1, 1), x = x[ind, ], maximum = TRUE )$maximum
stopCluster(cl)
ab <- as.vector( ww )
runtime <- proc.time() - runtime
}
203
8.6 Regression for compositional data
8.6.1 Regression using the additive log-ratio transformation
The additive log-ratio transformation (8.7) will be used for the implementation of regression
for compositional data. we could of course use the isometric log-ratio transformation (8.8)
but the interpretation of the parameters is really hard and as the dimensions increase it can
become impossible. The idea is simple. Apply the additive log-ratio transformation and
then do multivariate regression. In the end close the fitted values back into the simplex
using the inverse of the transformation.
The multivariate regression we have as option in the current function is either stan-
dard multivariate regression (see function multivreg) or robust multivariate regression (see
function rob.multivreg). Section 4.2 has more functions for multivariate regression analysis.
Should the user wish to use them, he/she can simply change the function comp.reg and in-
corporate the other regression functions.
yi
log = x T β i ⇔ log yi = log y D + x T β i , i = 1, . . . , d (8.24)
yD
are the regression coefficients and where p is the number of independent variables.
We see from (8.24) that when the dependent variable is the logarithm of any component,
the logarithm of the common divisor component can be treated as an offset variable; an
independent variable with coefficient equal to 1. But this is not something to worry about.
The only issue is that no zero values are allowed.
Let us now see an example in order to make this compositional regression a bit more
clear. Suppose we have the arctic lake dat from Aitchison (2003), where there are 39 mea-
surements of three elements, sand, silt and clay from different depths (in meters) of an arctic
lake. The logarithm of the depth is the independent variable (it’s a good idea to use the
logarithm of the independent variables, especially when these have high values). The result
of the regression is
We can see that the clay plays the role of the common divisor component. If the depth
is 1 meter, so log 1 = 0, then we can say that the percentage of sand is higher than that of
204
clay and the percentage of silt is higher than that of clay as well. The percentage of sand is
also higher than the percentage of silt (the constant term in the first line is higher than the
constant term in the second line). To find out what is the value of the composition at 1 meter
of water depth we do
9.697 4.805
C e ,e , 1 = (0.9925, 0.007, 50.0001) ,
where C (.) is the closure operation which means that we must divide by the sum of the
vector, so that is becomes compositional, i.e. its elements sum to 1. The negative coefficient
in the first line means that sand reduces relatively to clay as the water depth increases. The
same is true for silt relatively to clay. A good way to understand these coefficients is to plot
the logarithms of the ratios as a function of the independent variable. And then you will see
why there is a negative sign.
The next function
if (type == "classical") {
runtime <- proc.time()
mod <- multivreg(z, x, plot = FALSE, xnew = xnew) ## multivariate regression
res <- mod$suma
di <- ncol(z)
beta <- seb <- matrix(nrow = NCOL(x) + 1, ncol = di)
for (i in 1:di) {
beta[, i] <- res[, 1, i]
seb[, i] <- res[, 2, i]
}
rownames(seb) <- rownames(beta) <- rownames(res[, , 1])
colnames(seb) <- colnames(beta) <- colnames(mod$fitted)
205
est1 <- mod$est
runtime <- proc.time() - runtime
}
if (type == "spatial") {
mod <- spatmed.reg(z, x, xnew = xnew) ## spatial median regression
beta <- mod$beta
seb <- mod$seb
est1 <- mod$est
runtime <- mod$runtime
}
An alternative method for regression is to use the Dirichlet distribution (8.9) and (8.10). The
second form though (8.10) is more convenient and the estimated parameters have the same
interpretation as in the additive logistic regression (8.24).
We mentioned before that Maier (2011) has created an R package for Dirichlet regression.
For more information the reader is addressed to Maier’s report (Maier, 2014). The next
function does not come to substitute Maier’s functions, by no means. Maier (2011) allows the
possibility of modelling φ as well, linking it with the same covariates, where an exponential
link is necessary to ensure that the fitted φi s are always positive. This is presented in the
next Section.
Influence diagnostics are provided by Hijazi (2006) who suggested using a scaled Pear-
son χ2 statistic to identify influential observations. This was first introduced in Boyles (1997).
The idea is simple, use the following approximation.
(yi − ŷi )2
( φ + 1) ∑ ŷi
∼ χ2D−1 . (8.25)
i =1 D
So, one has to calculate the above statistic for all observations. Those observations ex-
ceeding the cut-off point of χ2D−1 are considered to have possibly high influence on the
regression model.
206
The Dirichlet density (the same as in (8.10)) is
Γ ∑iD=1 φai∗ D
φai∗ −1
f (x) =
∏iD=1 Γ φai ∗
∏ yi ,
i =1
where φ = ∑iD=1 ai and ∑iD=1 ai∗ = 1. The link function used for the parameters (except for φ)
is
a1∗ = 1
xT β j
∑D
j =1 e
xT β i
ai∗ = e
xT β j
for i = 2, ..., D.
∑D
j =1 e
n D n D
` = n log Γ (φ) − ∑ ∑ log Γ (φai∗ ) + ∑ ∑ (φai∗ − 1) log yij ,
j =1 i =1 j =1 i =1
The next function offers Dirichlet regression and produces an informative output. It is
important for the compositional data (dependent variable) to have column names otherwise
the function will not produce an output. If you do not want this, then simply remove the
lines in the codes which refer to the column names of the compositional data.
diri.reg <- function(y, x, plot = TRUE, xnew = NULL) {
## y is the compositional data
dm <- dim(y)
n <- dm[1] ## sample size
## the design matrix is created
x <- model.matrix(y ~ ., as.data.frame(x) )
d <- dm[2] - 1 ## dimensionality of the simplex
z <- log(y)
207
runtime <- proc.time()
rla <- z[, -1] - z[, 1] ## log(y[, -1] / y[, 1]) ## alr transformation
ini <- as.vector( coef( lm.fit(x, rla) ) ) ## initial values
## based on the logistic normal
## the next lines optimize the dirireg function and
## estimate the parameter values
el <- NULL
options(warn = -1)
qa <- nlm(dirireg, c(3, ini), z = z, x = x, n = n, d = d)
el[1] <- -qa$minimum
qa <- nlm(dirireg, qa$estimate, z = z, x = x, n = n, d = d)
el[2] <- -qa$minimum
vim <- 2
if ( !is.null( colnames(y) ) ) {
colnames(seb) <- colnames(y[, -1])
} else colnames(seb) <- paste("Y", 1:d, sep = "")
if ( !is.null(xnew) ) {
xnew <- model.matrix(~., data.frame(xnew) )
mu <- cbind( 1, exp(xnew %*% beta) )
208
est <- mu / Rfast::rowsums(mu)
} else {
mu <- cbind( 1, exp(x %*% beta) )
est <- mu / Rfast::rowsums(mu) ## fitted values
lev <- ( exp(log.phi) + 1 ) * Rfast::rowsums( (y - est)^2 / mu )
if ( plot ) {
plot(1:n, lev, main = "Influence values", xlab = "Observations",
ylab = expression( paste("Pearson ", chi^2, "statistic") ) )
lines(1:n, lev, type = "h")
abline(h = qchisq(0.95, d), lty = 2, col = 2)
}
}
if ( is.null( colnames(x) ) ) {
p <- dim(x)[2] - 1
rownames(beta) <- c("constant", paste("X", 1:p, sep = "") )
if ( !is.null(seb) ) {
rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
}
} else {
rownames(beta) <- c("constant", colnames(x)[-1] )
if ( !is.null(seb) ) rownames(seb) <- c("constant", colnames(x)[-1] )
}
In the previous section we linked the parameters with some covariates. The mixed Dirichlet
regression refers to the case when the parameter φ is also linked to the same covariates. So,
instead of having the same value of φ for all compositional vectors, we allow it to vary as a
function of the covariates.
209
The link function, used here, is the logarithm, to ensure that it’s always positive
p
x Tj γ k
φ∗j = e∑k=1 (8.26)
This type of regression can also be found in Maier’s report (Maier, 2014) as we have
mentioned again before.
This means that we have to substitute the precision parameter φ in (8.11) with (8.26).
n n D n D
` = ∑ log Γ φ ∗
j − ∑ ∑ log Γ φ ∗ ∗
a
j i + ∑ ∑ φ∗j ai∗ − 1 log yij .
j =1 j =1 i =1 j =1 i =1
The next function offers Dirichlet regression and produces an informative output. It is
important for the compositional data (dependent variable) to have column names otherwise
the function will not produce an output. If you do not want this, then simply remove the
lines in the codes which refer to the column names of the compositional data.
210
## based on the logistic normal
## the next lines optimize the dirireg2 function and
## estimate the parameter values
el <- NULL
qa <- nlm(dirireg2, c(rnorm(p, 0, 0.1), as.vector( t(ini) ) ) )
el[1] <- -qa$minimum
qa <- nlm(dirireg2, qa$estimate)
el[2] <- -qa$minimum
vim <- 2
while (el[vim] - el[vim - 1] > 1e-06) {
## the tolerance value can of course change
vim <- vim + 1
qa <- nlm(dirireg2, qa$estimate)
el[vim] <- -qa$minimum
}
if ( !is.null( colnames(y) ) ) {
colnames(beta) <- colnames(seb) <- colnames(y[, -1])
} else colnames(beta) <- colnames(seb) <- paste("Y", 1:d, sep = "")
if ( !is.null(xnew) ) {
xnew <- model.matrix(~., data.frame(xnew) )
mu <- cbind( 1, exp(xnew %*% beta) )
est <- mu / Rfast::rowsums(mu)
211
} else {
mu <- cbind( 1, exp(x %*% beta) )
est <- mu / Rfast::rowsums(mu) ## fitted values
}
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
rownames(beta) <- c("constant", paste("X", 1:p, sep = "") )
if ( !is.null(seb) ) rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
} else {
rownames(beta) <- c("constant", colnames(x)[-1] )
if ( !is.null(seb) ) rownames(seb) <- c("constant", colnames(x)[-1] )
}
The next regression method is simply an OLS, like the comp.reg but applied to the raw com-
positional data, i.e. without log-ratio transforming them. This approach I saw it in Murteira
and Ramalho (2014), where they mention that B̂, the matrix of the estimated regression
coefficients, is consistent and asymptotically normal. How is B̂ calculated? Simply by mini-
mizing the sum of squares of the residuals
n
∑ uiT ui , where ui = yi − Gi (B) and
i =1
xiT β 2 xiT β d
1 e e
Gi ( B ) = , ,..., ,
xiT β j xiT β j xiT β j
∑D
j =1 e ∑D
j =1 e ∑D
j =1 e
with yi ∈ Sd and d = D − 1, where D denotes the number of components. The link function
ise same as before, the inverse of the additive log-ratio transformation.
The next R function offers the possibility of bootstrapping the standard errors of the
betas. If no bootstrap is selected no standard errors will be produced.
212
## y is dependent variable, the compositional data
## x is the independent variable(s)
## B is the number of bootstrap samples used to obtain
## standard errors for the bes
## if B==1 no bootstrap is performed and no standard errors reported
## if ncores=1, then 1 processor is used, otherwise
## more are used (parallel computing)
## the next lines minimize the reg function and obtain the betas
ini <- as.vector( t( coef(lm.fit(x, y[, -1]) ) ) ) ## initial values
oop <- options(warn = -1)
on.exit(options(oop))
qa <- nlm(olsreg, ini, y = y, x = x, d = d)
qa <- nlm(olsreg, qa$estimate, y = y, x = x, d = d)
qa <- nlm(olsreg, qa$estimate, y = y, x = x, d = d)
beta <- matrix(qa$estimate, byrow = TRUE, ncol = d)
seb <- NULL
runtime <- proc.time() - runtime
if (B > 1) {
nc <- ncores
if (nc == 1) {
runtime <- proc.time()
betaboot <- matrix(nrow = B, ncol = length(ini))
for (i in 1:B) {
ida <- sample(1:n, n, replace = TRUE)
213
yb <- y[ida, ]
xb <- x[ida, ]
ini <- as.vector( t( coef(lm.fit(xb, yb[, -1]) ) ) ) ## initial values
qa <- nlm(olsreg, ini, y = yb, x = xb, d = d)
qa <- nlm(olsreg, qa$estimate, y = yb, x = xb, d = d)
qa <- nlm(olsreg, qa$estimate, y = yb, x = xb, d = d)
betaboot[i, ] <- qa$estimate
}
s <- Rfast::colVars(betaboot, std = TRUE)
seb <- matrix(s, byrow = TRUE, ncol = d)
runtime <- proc.time() - runtime
} else {
runtime <- proc.time()
betaboot <- matrix(nrow = B, ncol = length(ini) )
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
ww <- foreach::foreach(i = 1:B, .combine = rbind, .export="olsreg")
%dopar% {
ida <- sample(1:n, n, replace = TRUE)
yb <- y[ida, ]
xb <- x[ida, ]
ini <- as.vector( t( coef(lm.fit(xb, yb[, -1]) ) ) ) ## initial values
qa <- nlm(olsreg, ini, y = yb, x = xb, d = d)
qa <- nlm(olsreg, qa$estimate, y = yb, x = xb, d = d)
qa <- nlm(olsreg, qa$estimate, y = yb, x = xb, d = d)
betaboot[i, ] <- qa$estimate
}
stopCluster(cl)
s <- Rfast::colVars(ww, std = TRUE)
seb <- matrix(s, byrow = TRUE, ncol = d)
runtime <- proc.time() - runtime
}
}
if ( is.null(xnew) ) {
mu <- cbind( 1, exp(x %*% beta) )
est <- mu / Rfast::rowsums(mu)
} else {
214
xnew <- model.matrix(~., data.frame(xnew) )
mu <- cbind(1, exp(xnew %*% beta))
est <- mu / Rfast::rowsums(mu)
}
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
rownames(beta) <- c("constant", paste("X", 1:p, sep = "") )
if ( !is.null(seb) ) rownames(seb) <- c("constant",
paste("X", 1:p, sep = "") )
} else {
rownames(beta) <- c("constant", colnames(x)[-1] )
if ( !is.null(seb) ) rownames(seb) <- c("constant", colnames(x)[-1] )
}
8.6.5 Multinomial logit regression (or Kullback-Leibler divergence based regression for
compositional data)
Murteira and Ramalho (2014) studied the multinomial logit regression using a more gen-
eral transformation than the inverse of the additive logistic transformation. In fact (8.27)
corresponds to the maximisation of the multinomial log-likelihood. However, here we will
use the the inverse of the additive logistic transformation as the link function. The goal is to
minimize the KL divergence with respect to the regression parameters. In logistic regression,
this method is called estimation by minimum discrimination information (Agresti, 2002).
( ) ( )
n Dyij n D
min ∑ ∑ yij log = min − ∑ ∑ yij log f (x) =
βi j =1 i =1
f (x) βi j =1 i =1
( )
n D
max
βi
∑ ∑ yij log f (x) , (8.27)
j =1 i =1
where x stands for the design matrix and f is the function defined below. For every value of
the composition yij there corresponds a fitted value ŷij which is a function of some covariates
through an exponential form.
1
ŷ1j =
1+∑lD=2
xT β
e l l
f (x) = xT β i
e i
ŷij =
xT β l
for i = 2, ..., D.
1+∑lD=2 e l
215
As for the properties of the coefficients, Murteira and Ramalho (2014) shows that are
asymptotically normal, so this is good news. However, I saw that their standard errors are
not similar to the ones obtained from the other methods. So, I offer the option of bootstrap
estimation of their standard errors in the next R function.
The second key thing about this regression method is that even if there are zeros in the
observed values, there is absolutely no problem. This advantage of this method was not
highlighted by Murteira and Ramalho (2014) and I am making this point now. Just think
about it, no need for zero value imputation, so no extra added bias or variability.
## the next lines minimize the reg function and obtain the estimated betas
ini <- as.vector( t( coef( lm.fit(x, y[, -1]) ) ) ) ## initial values
216
runtime <- proc.time() - runtime
if (B > 1) {
nc <- ncores
if (nc == 1) {
runtime <- proc.time()
betaboot <- matrix(nrow = B, ncol = length(ini))
for (i in 1:B) {
ida <- sample(1:n, n, replace = TRUE)
yb <- y[ida, ]
xb <- x[ida, ]
ini <- as.vector( t( coef( lm.fit(xb, yb[, -1]) ) ) ) ## initial values
qa <- nlm(klreg, ini, y = yb, x = xb, d = d)
qa <- nlm(klreg, qa$estimate, y = yb, x = xb, d = d)
qa <- nlm(klreg, qa$estimate, y = yb, x = xb, d = d)
betaboot[i, ] <- qa$estimate
}
s <- Rfast::colVars(betaboot, std = TRUE)
seb <- matrix(s, byrow = TRUE, ncol = d)
runtime <- proc.time() - runtime
} else {
runtime <- proc.time()
betaboot <- matrix(nrow = B, ncol = length(ini) )
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
ww <- foreach::foreach(i = 1:B, .combine = rbind) %dopar% {
ida <- sample(1:n, n, replace = TRUE)
yb <- y[ida, ]
xb <- x[ida, ]
ini <- as.vector( t( coef( lm.fit(xb, yb[, -1]) ) ) ) ## initial values
qa <- nlm(klreg, ini, y = yb, x = xb, d = d)
qa <- nlm(klreg, qa$estimate, y = yb, x = xb, d = d)
qa <- nlm(klreg, qa$estimate, y = yb, x = xb, d = d)
betaboot[i, ] <- qa$estimate
}
stopCluster(cl)
217
seb <- matrix(s, byrow = TRUE, ncol = d)
runtime <- proc.time() - runtime
}
}
if ( is.null(xnew) ) {
mu <- cbind( 1, exp(x %*% beta) )
est <- mu / Rfast::rowsums(mu)
} else {
xnew <- model.matrix(~., data.frame(xnew) )
mu <- cbind(1, exp(xnew %*% beta))
est <- mu / Rfast::rowsums(mu)
}
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
rownames(beta) <- c("constant", paste("X", 1:p, sep = "") )
if ( !is.null(seb) ) rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
} else {
rownames(beta) <- c("constant", colnames(x)[-1] )
if ( !is.null(seb) ) rownames(seb) <- c("constant", colnames(x)[-1] )
}
I have recently suggested (Tsagris, 2015a) that as a measure of the distance between two
compositions we can use a special case of the Jensen-Shannon divergence
!
D 2x j 2y j
ES-OV (x, y) = ∑ x j log
xj + yj
+ y j log
xj + yj
, (8.28)
j =1
where x and y ∈ Sd . Endres and Schindelin (2003) and Österreicher and Vajda (2003) proved,
independently, that (8.28) satisfies the triangular identity and thus it is a metric. The names
ES-OV comes from the researchers’ initials. In fact, (8.28) is the square of the metric, still a
metric, and we will use this version.
The idea is simple and straightforward, minimization of the ES-OV metric between the
218
observed and the fitted compositions with respect to the beta coefficients
D
2yi 2ŷi
min ∑ yi log + ŷi log , (8.29)
β i =1
yi + ŷi yi + ŷi
## the next lines minimize the kl.compreg function and obtain the estimated betas
ini <- as.vector( t( kl.compreg(y, x[, -1])$beta ) )
if (B > 1) {
219
betaboot <- matrix( nrow = B, ncol = length(ini) )
nc <- ncores
if (nc == 1) {
runtime <- proc.time()
for (i in 1:B) {
ida <- sample( 1:n, n, replace = TRUE )
yb <- y[ida, ]
xb <- x[ida, ]
ini <- as.vector( t( kl.compreg(yb, xb[, -1])$beta ) ) ## initial values
qa <- nlm(jsreg, ini, y = yb, x = xb, d = d)
qa <- nlm(jsreg, qa$estimate, y = yb, x = xb, d = d)
qa <- nlm(jsreg, qa$estimate, y = yb, x = xb, d = d)
betaboot[i, ] <- qa$estimate
}
s <- Rfast::colVars(betaboot, std = TRUE)
seb <- matrix(s, byrow = TRUE, ncol = d)
runtime <- proc.time() - runtime
} else {
runtime <- proc.time()
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
ww <- foreach::foreach(i = 1:B, .combine = rbind, .export="jsreg") %dopar% {
ida <- sample(1:n, n, replace = TRUE)
yb <- y[ida, ]
xb <- x[ida, ]
ini <- as.vector( t( kl.compreg(yb, xb[, -1])$beta ) ) ## initial values
qa <- nlm(jsreg, ini, y = yb, x = xb, d = d)
qa <- nlm(jsreg, qa$estimate, y = yb, x = xb, d = d)
qa <- nlm(jsreg, qa$estimate, y = yb, x = xb, d = d)
betaboot[i, ] <- qa$estimate
}
stopCluster(cl)
s <- Rfast::colVars(ww, std = TRUE)
seb <- matrix(s, byrow = TRUE, ncol = d)
runtime <- proc.time() - runtime
}
}
220
if ( is.null(xnew) ) {
mu <- cbind( 1, exp(x %*% beta) )
est <- mu / Rfast::rowsums(mu)
} else {
xnew <- model.matrix(~., data.frame(xnew) )
mu <- cbind(1, exp(xnew %*% beta))
est <- mu / Rfast::rowsums(mu)
}
if ( is.null(colnames(x)) ) {
p <- dim(x)[2] - 1
rownames(beta) <- c("constant", paste("X", 1:p, sep = "") )
if ( !is.null(seb) ) rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
} else {
rownames(beta) <- c("constant", colnames(x)[-1] )
if ( !is.null(seb) ) rownames(seb) <- c("constant", colnames(x)[-1] )
}
We will use the inverse of the additive logistic transformation (8.7), combined with the α-
transformation (8.15), as a link function. This is a new regression using the α-transformation
which allows for more flexibility even in the presence of zero values (Tsagris, 2015b). An-
other feature of this method is that the line is always curved (unless α is far away from zero)
and so it can be seen not only as a generalization of the log-ratio regression but also as a
flexible type compositional regression in the sense that the curvature of the line is chosen
based on some discrepancy criteria, examined later.
In order for the fitted values to satisfy the constraint imposed by the simplex we model
the inverse of the additive logistic transformation of the mean response. Hence, the fitted
values will always lie within Sd and we also retain the flexibility the α-transformation offers.
We assume that the conditional mean of the observed composition can be written as a
non-linear function of some covariates
1
µ1 = xT β j
1+∑dj=1 e
Tβ (8.30)
ex i
µi = xT β
for i = 2, ..., D,
1+∑dj=1 e j
221
where
T
β i = β 0i , β 1i , ..., β pi , i = 1, ..., d and p denotes the number of covariates.
n 1 h
−1
i
l (α) = − log Σ̂ Σ α (Yα − Mα ) T ,
Σ − tr (Yα − Mα ) Σ̂ (8.31)
2 2
where Yα and Mα are the α-transformed response and fitted compositional vectors. We
have ignored the Jacobian determinant of the α-transformation since it plays no role in the
optimization process and the choice of α For each value of α we maximize the value of this
Σ needs not be numerically estimated, since B̂, the matrix
objective function (8.31). The Σ̂
Σ are statistically independent (Mardia et al., 1979). The maximum
of the estimates and Σ̂
likelihood estimator of Σ is (Mardia et al., 1979)
222
A final key feature we have to note is that when α → 0 we end up with the additive
log-ratio regression (8.24). In the next function the last argument (”yb”) allows you to put
the α-transformed data directly. This is useful in the case of the cross-validation, to avoid
transforming the data every time, for every fold, only for every value of α.
D <- dim(y)[2]
d <- D - 1 ## dimensionality of the simplex
dm <- dim(x)
p <- dm[2] ; n <- dm[1]
if ( is.null(yb) ) {
ya <- alfa(y, a)$aff
} else ya <- yb
x <- model.matrix(ya ~., data.frame(x) )
if ( a == 0 ) {
mod <- comp.reg(y, x[, -1], yb = yb)
be <- mod$be
seb <- mod$seb
runtime <- mod$runtime
} else {
223
runtime <- proc.time()
ha <- t( helm(D) )
ini <- as.vector( solve(crossprod(x), crossprod(x, ya) ) )
qa <- nlminb( ini, reg, ya = ya, x = x, ha = ha, d = d, n = n,
D = D, control = list(iter.max = 2000) )
qa <- optim( qa$par, reg, ya = ya, x = x, ha = ha, d = d, n = n,
D = D, control = list(maxit = 5000) )
qa <- optim( qa$par, reg, ya = ya, x = x, ha = ha, d = d, n = n,
D = D, control = list(maxit = 5000) )
qa <- optim( qa$par, reg, ya = ya, x = x, ha = ha, d = d, n = n,
D = D, control = list(maxit = 5000), hessian = TRUE )
be <- matrix(qa$par, ncol = d)
runtime <- proc.time() - runtime
if ( seb ) {
seb <- sqrt( diag( Rfast::spdinv(qa$hessian) ) )
seb <- matrix(seb, ncol = d)
} else seb <- NULL
} ## end if (a == 0)
if ( is.null( colnames(x) ) ) {
p <- dim(x)[2] - 1
rownames(be) <- c("constant", paste("X", 1:p, sep = "") )
if ( !is.null(seb) ) rownames(seb) <- c("constant", paste("X", 1:p, sep = "") )
} else {
rownames(be) <- c("constant", colnames(x)[-1] )
if ( !is.null(seb) ) rownames(seb) <- c("constant", colnames(x)[-1] )
}
224
The disadvantage of the profile log-likelihood of α (should you use it), for choosing the
value of α, is that it does not allow zeros. On the other hand, it provides the maximum
likelihood estimates which are asymptotically normal. But the latter is not entirely true,
since the resulting normal is not defined on whole of Rd .
I suggest an alternative and perhaps better way of choosing the value of α. Better in
the sense that it is trying to take into account the proximity between the observed and the
fitted values. The criterion is to choose the α which minimizes twice the Kullback-Leibler
divergence (Kullback, 1997)
n D yij
KL = 2 ∑ ∑ yij log ŷij , (8.32)
j =1 i =1
where yij is the observed compositional point and ŷij is the corresponding fitted value. The
form of the deviance for the log-linear models and the logistic regression has the same ex-
pression as well. Hence, I transfer the same form of divergence to compositional data. For
every value of α we estimate the parameters of the regression and choose the value of α
which minimizes (8.32).
The number 2 is there because in the case of D = 2 we end up with the log-likelihood of
the binary logistic regression. The Kullback-Leibler divergence (8.32) takes into account the
divergence or the distance of each of the observed values from the fitted values.
Since I am interested in prediction analysis I use a K-fold cross-validation to choose the
value of α. I split the data into K sets (fold). Every time I leave a set out and fit the model
in the remaining sample (chose the best value of α and so on). Then, I predict the values of
the compositional data for the set left outside and calculate the Kullback-Leibler divergence
(8.32) in order to measure the performance. This is repeated for all K sets (folds) of data and
the average Kullback-Leibler divergence is obtained.
In the published paper Tsagris (2015b) I just used the performance obtained using the
fitted values in order to save time. I could have used a repeated hold-out cross validation,
but neither of them is the appropriate one.
225
ina <- 1:n
x <- model.matrix(y ~., data.frame(x) )
if ( is.null(folds) ) folds <- Compositional::makefolds(ina,
nfolds = nfolds, stratified = FALSE, seed = seed)
nfolds <- length(folds)
if (nc <= 1) {
apa <- proc.time()
kula <- matrix(nrow = nfolds, ncol = la)
for (j in 1:la) {
ytr <- alfa(y, a[j])$aff
for (i in 1:nfolds) {
xu <- x[ folds[[ i ]], -1 , drop = FALSE]
yu <- y[ folds[[ i ]], ]
xa <- x[ -folds[[ i ]], -1]
yb <- ytr[ -folds[[ i ]], ]
mod <- alfa.reg(yu, xa, a[j], xnew = xu, yb = yb)
yest <- mod$est
kula[i, j] <- 2 * mean(yu * log(yu / yest), na.rm = TRUE)
}
}
kl <- Rfast::colmeans(kula)
opt <- a[ which.min(kl) ]
val <- which.min(kl)
per <- min(kl, na.rm = TRUE)
pera <- Rfast::rowMins(kula, value = TRUE)
apa <- proc.time() - apa
} else {
apa <- proc.time()
oop <- options(warn = -1)
on.exit( options(oop) )
val <- matrix(a, ncol = nc) ## if the length of a is not equal to the
## dimensions of the matrix val a warning message should appear
## but with oop <- options(warn = -1) you will not see it
cl <- parallel::makePSOCKcluster(nc)
doParallel::registerDoParallel(cl)
if ( is.null(folds) ) folds <- Compositional::makefolds(ina, nfolds = nfolds,
226
stratified = FALSE, seed = seed)
kula <- foreach::foreach(j = 1:nc, .combine = cbind, .packages = "Rfast",
.export = c("alfa.reg", "alfa", "helm", "comp.reg", "multivreg",
"rowsums", "colmeans", "colVars") ) %dopar% {
ba <- val[, j]
ww <- matrix(nrow = nfolds, ncol = length(ba) )
for ( l in 1:length(ba) ) {
ytr <- alfa(y, ba[l])$aff
for (i in 1:nfolds) {
xu <- x[ folds[[ i ]], -1 , drop = FALSE]
yu <- y[ folds[[ i ]], ]
xa <- x[ -folds[[ i ]], -1]
yb <- ytr[ -folds[[ i ]], ]
mod <- alfa.reg(yu, xa, ba[l], xnew = xu, yb = yb)
yest <- mod$est
ww[i, l] <- 2 * mean(yu * log(yu / yest), na.rm = TRUE)
}
}
return(ww)
}
parallel::stopCluster(cl)
if ( graph ) {
plot( a, kula[1, ], type = ’l’, ylim = c( min(kula), max(kula) ),
xlab = expression(alpha), ylab = ’Twice the Kullback Leibler divergence’,
cex.lab = 1.3 )
for (i in 2:nfolds) lines(a, kula[i, ])
lines(a, kl, col = 2, lty = 2, lwd = 2)
}
227
list(runtime = apa, kula = kula, kl = kl, opt = opt, value = per)
}
We will now see a simple approach to the case of both dependent and independent variables
being compositional variables. The key thing is principal component regression. Transform
the independent compositional variables using the isometric log-ratio transformation (8.8).
You can of course use the additive log-ratio transformation (8.7), but I chose the first one as
I do not think it makes that much of a difference.
Perform principal component analysis on the transformed data and calculate the scores.
Use them as the independent variables and do the compositional data regression, either
the classical multivariate regression of the additively log-ratio transformed data (comp.reg
function), the Dirichlet regression (diri.reg or diri.reg2 functions), the OLS (ols.compreg func-
tion) or the multinomial logit regression (kl.compreg function). In addition, you can choose
how many principal components you want to keep. The drawback of this way, is that the
regression coefficients are not unbiased and consistent, since, we are using the principal
component scores. So, this way is mostly for prediction purposes.
In addition, we would not want the independent variables to have zeros. If the depen-
dent variables have zero, we can deal with it, use zero value replacement (packagerobCompositions)
and then the regression models or use the kl.compreg or ols.compreg functions which work
even in the presence of zeros.
228
per <- values/sum(values) ## proportion of each eigenvalue
vec <- eig$vectors ## eigenvectors, or principal components
pc <- z %*% vec[, 1:k] ## PCA scores
The way to choose k, the number of principal components to use is the same as in the
case of the principal component regression. Split the data into training and test set. Use
the training set to estimate the parameters of the model and then use the test set for pre-
diction purposes. Calculate the Kullback-Leibler divergence of the observed from the fitted
compositional vectors
n D yij
∑ ∑ yij log ŷij .
j =1 i =1
Repeat this process say 200 times and calculate the average for different number of principal
components. Note, that the maximum number of components you can have is p − 1, where
p stands for the number of components of the independent compositional variables.
229
nu <- round(fraction * n) ## test sample size
deigma <- matrix(nrow = R, ncol = nu)
## deigma will contain the positions of the test set
## this is stored but not showed in the end
## the user can access it though by running
## the commands outside this function
crit <- matrix(nrow = R, ncol = p)
## if seed==TRUE then the results will always be the same
if ( seed ) set.seed(1234567)
for (vim in 1:R) deigma[vim, ] <- sample(1:n, nu)
for (j in 1:p) {
for (i in 1:R) {
ytest <- y[deigma[i, ], ]
xtest <- x[deigma[i, ], ]
ytrain <- y[-deigma[i, ], ]
xtrain <- x[-deigma[i, ], ]
be <- comp.compreg(ytrain, xtrain, type = type, j)$mod$beta
mu1 <- cbind(1, exp(cbind(1, xtest) %*% be))
mu <- mu1/rowSums(mu1)
crit[i, j] <- sum(ytest * log(ytest/mu), na.rm = T)
}
}
mspe <- Rfast::colmeans(crit)
names(mspe) <- paste( "PC ", 1:p )
plot( mspe, type = ’b’, ylab = "MSPE values",
xlab = "Number of principal components" )
list(optimal = which.min(mspe), mspe = mspe)
}
8.6.9 Univariate regression where the independent variables are compositional data us-
ing the α-transformation
The α-transformation can be used again in the univariate regression when the independent
variables are actually compositional data. The idea is again simple, apply the α-transformation
(8.15) to the compositional data (independent variables) and then perform principal compo-
nent regression. If you perform the isometric log-ratio transformation for example (8.8)
or the α-transformation in general, and then regression directly, you are neglecting the
collinearity issues. That is why I propose (Tsagris, 2015b) to use principal components re-
gression. What is more, is that if you have zeros in the data, the α-transformation will still
230
work without the need for imputation. In that case, α must be non negative. The next
function is more general than the one found in Tsagris (2015b) in the sense that principal
component regression for binary (binomial) and count data (poisson) are now offered.
if ( length(unique(y)) == 2 ) {
oiko <- "binomial"
} else if ( sum( y - round(y) ) == 0 ) {
oiko <- "poisson"
} else oiko <- "normal"
if (oiko == ’normal’) {
mod <- pcr(y, z, k, xnew = xnew)
} else mod <- glm.pcr(y, z, k, xnew = xnew)
mod ## principal component regression with the alpha-transformed
## compositional data
}
The task now is to choose the optimal pair of (α, k) values. To do so, cross validation is
to employed once again. For a grid of values of α every time α-transform the data and then
find the optimal number of principal components via cross validation. Optimal in the sense
of minimizing the mean squared error of the predictions. For every value of α, it transforms
the data and then performs principal component regression according to the distribution set
by the user.
231
n <- dim(x)[1]
d <- dim(x)[2] - 1
if ( min(x) == 0 ) a <- a[ a > 0 ] ## checks for zero values in the data.
da <- length(a)
ina <- 1:n
if ( is.null(folds) ) folds <- Compositional::makefolds(ina, nfolds = nfolds,
stratified = FALSE, seed = seed)
nfolds <- length(folds)
mspe2 <- array( dim = c( nfolds, d, da) )
if ( model == ’gaussian’ ) {
tic <- proc.time()
for ( i in 1:da ) {
z <- Compositional::alfa(x, a[i])$aff
mod <- Compositional::pcr.tune(y, z, nfolds = nfolds,
maxk = maxk, folds = folds, ncores = ncores, seed = seed, graph = FALSE)
mspe2[, , i] <- mod$msp
}
toc <- proc.time() - tic
232
}
g
h (x|Θ ) = ∑ πi fi (x|θ i ) ,
i =1
need to be estimated also. I will describe the EM algorithm briefly, because I am not an
233
expert, for this example.
The EM stands for Expectation and Maximization, the two steps of the algorithm. The key
idea behind this algorithm is to perform likelihood maximization or parameter estimation
when some information is missing. In our case, the missing information is the mixture
probabilities, how many populations are there and which are their mixing probabilities from
which the data were generated. The E step comes here, it calculates an expected value
for this missing information. Then, with this knowledge, we can maximize the objective
function and estimate its parameters.
The t-th step of the algorithm is briefly described below
πit−1 f i (x|θ i )
pijt = D t −1
∑m =1 πm f m ( x |θ m )
n n t t T
∑ ∑
j = 1 p ij x j j = 1 p ij x ij − µ i x ij − µ i
µ it = and Σ it = (i = 1, . . . , g)
∑nj=1 pij ∑nj=1 pij
Step 3. Repeat the E and M steps until the log-likelihood does not increase any more.
(Browne et al., 2015) perform a K-means algorithm for initialization of the EM algorithm.
Another point that is worthy to mention is that when (Fraley et al., 2012) wrote their R pack-
age mclust based on a paper by Fraley and Raftery (2002) allowed for 10 possible models.
Browne et al. (2015) include all 14 possible models. When we say models, we mean different
types of covariance matrices, listed below
1. ”EII”: All groups have the same diagonal covariance matrix, with the same variance
for all variables.
2. ”VII”: Different diagonal covariance matrices, with the same variance for all variables
within each group.
234
7. ”EEE”: All covariance matrices are the same.
8. ”EEV”: Different covariance matrices, but with the same determinant and in addition,
if we make them have determinant 1, they will have the same trace.
9. ”VEV”: Different covariance matrices but if we make the matrices have determinant 1,
then they will have the same trace.
11. ”EVE”: Different covariance matrices, but with the same determinant. In addition,
calculate the eigenvectors for each covariance matrix and you will see the extra simi-
larities.
12. ”VVE”: Different covariance matrices, but they have something in common with their
directions. Calculate the eigenvectors of each covariance matrix and you will see the
similarities.
13. ”VEE”: Different covariance matrices, but if we make the matrices have determinant
1, then they will have the same trace. In addition, calculate the eigenvectors for each
covariance matrix and you will see the extra similarities.
14. ”EVV”: Different covariance matrices, but with the same determinant.
As we can see, there are many combinations of similarities when the covariance matrices
are diagonal and non diagonal. Below is the functions which utilises the gpcm function
within the mixture R package
if (type == "alr") {
y <- log(x[, -p]/x[, p])
} else {
y0 <- log(x)
y1 <- y0 - Rfast::rowmeans( y0 )
235
y <- tcrossprod( y1, helm(p) )
BIC is used to choose the optimal model. So, first one has to run the next function (bic.mixcompnorm)
and see which model has the lowest BIC and then use the mix.compnorm function for model
based clustering.
if (type == "alr") {
236
y <- log(x[, -p]/x[, p])
} else {
y0
y1 <- y0 - Rfast::rowmeans( y0 )
y <- tcrossprod( y1, helm(p) )
}
In order to simulate random values from a normal mixture model for compositional data,
the following R code is to be used.
237
}
if (type == "alr") {
x1 <- cbind(1, exp(x) )
x <- x1 / Rfast::rowsums(x1)
} else {
x1 <- tcrossprod( x, helm( p + 1) )
x2 <- exp(x1)
x <- x2 / Rfast::rowsums( x2 )
}
## x is the simulated data
## data come from the first cluster, then from the second and so on
list(id = ina, x = x)
}
where x, w ∈ Sd .
Endres and Schindelin (2003) and Österreicher and Vajda (2003) proved, independently,
that (8.33) satisfies the triangular identity and thus it is a metric. For this reason we will refer
to it as the ES-OV metric.
We will use the power transformation (8.15) to define a more general metric termed ES-
OVα metric
xiα wiα 1/2
D 2 D 2 D
xi
α ∑ j=1 x αj wiα ∑ j=1 wαj
ES − OVα (x, w) = ∑ log + log . (8.34)
D α
∑ j =1 x j xiα wiα
∑D xiα wiα
j =1 w j
α
i =1 + +
∑D α
j =1 x j ∑D α
j =1 w j ∑D α
j =1 x j ∑D α
j =1 w j
The taxicab metric is also known as L1 (or Manhattan) metric and is defined as
D
TC (x, w) = ∑ | x i − wi | (8.35)
i =1
We will again employ the power transformation (8.15) to define a more general metric which
238
we will term the TCα metric
xα D w α
TCα (x, w) = ∑ D i α − D i α (8.36)
i =1 ∑ j =1 x j ∑ j =1 w j
The last two power transformed metrics were suggested by Tsagris (2014), but only the
case for α = 1 in (8.34) was examined in Tsagris et al. (2016b). Aitchison (2003) suggested the
Euclidean metric applied to the log-ratio transformed data as a measure of distance between
compositions
"
D 2 #1/2
x wi
Ait (x, w) = ∑ log i − log
g (x) g (w)
, (8.37)
i =1
2. or the angular metric if we treat compositional data as directional data (for more infor-
mation about this approach see Stephens (1982) and Scealy and Welsh (2014, 2011a,b))
!
D
√ √
Ang (x, w) = arccos ∑ x i wi (8.39)
i =1
1. Separate the data into the training and the test dataset.
3. There are two possibilities here (not mentioned in Tsagris (2014)). One can use either
the standard version or the non-standard version of the algorithm.
(a) Standard version. Calculate the distances of a point in the test set z0 from all the
points in the training set (there are ways to avoid all these calculations, but I did
239
this) and keep the k points in the training set which have the k smallest distances.
Allocate the point z0 to the group which has the most of these k points. In case of
ties, for example, 2 observations from group 1 and two observations from group
2 then, do the allocation randomly (again there are better ways, instead of ran-
domness).
(b) Non-Standard version. This is what has been done in Tsagris (2014). Calculate
the distances of z0 from all the points in the training set. For every group, keep
the k points with the smallest distances and then calculate the average of these k
distances for each group. Allocate z0 to the group with the minimum average of
k distances. Another option offered here is to use the median of the k distances.
4. Classify the test data using either the ES-OVα (8.34), the TCα (8.36) for a range of values
of α and each time calculate the percentage of correct classification. Alternatively, if
the Aitchisonian (8.37), the Hellinger (8.38) or the angular distance (8.39) are used, the
value of α is 1.
6. Repeat steps 1 − 4 B (I did it with B = 200) times and for each α and k and estimate the
percentage of correct classification by averaging over all B times.
The next function takes some compositional data whose groupings are known. For given
values of α and for one or more values of k it will allocate some new data. You can specify
which version to use, the standard or the non-standard, which metric of the aforementioned
to use and whether the mean or the median will be used (in the case of the non-standard
version).
240
if ( apostasi == "taxicab" ) {
xa <- x^a
zx <- xa / Rfast::rowsums( xa ) ## The power transformation is applied
za <- xnew^a
znew <- za / Rfast::rowsums( za ) ## The power transformation is applied
g <- Rfast::knn(znew, ina, zx, k = k, dist.type = "mahattan", type = "C",
freq.option = 1)
} else if ( apostasi == "Ait" ) {
xa <- Rfast::Log(x)
zx <- xa - Rfast::rowmeans( xa )
za <- Rfast::Log(xnew)
znew <- za - Rfast::rowmeans( za )
g <- Rfast::knn(znew, ina, zx, k = k, dist.type = "euclidean", type = "C",
freq.option = 1)
} else {
## all other methods
if ( apostasi == "taxicab" ) {
xa <- x^a
zx <- xa / Rfast::rowsums( xa ) ## The power transformation is applied
za <- xnew^a
znew <- za / Rfast::rowsums( za ) ## The power transformation is applied
disa <- Rfast::dista(znew, zx, "manhattan", trans = FALSE)
241
} else if ( apostasi == "angular" ) {
zx <- sqrt(x)
znew <- sqrt(xnew)
disa <- tcrossprod(zx, znew )
disa[disa >= 1] <- 1
disa <- acos(disa)
242
if (type == "NS") { ## Non Standard algorithm
ta <- matrix(nrow = nu, ncol = nc)
for (m in 1:nc) {
apo <- disa[ina == m, ]
apo <- Rfast::sort_mat(apo)
if ( mesos ) {
ta[, m] <- Rfast::colmeans( apo[1:k, , drop = FALSE] )
} else ta[, m] <- Rfast::colhameans( apo[1:k, , drop = FALSE] )
}
g <- as.matrix( Rfast::rowMins(ta) )
243
} else { ## if type is "S" ## Standard algorithm
for (j in 1:klen) {
g1 <- Rfast::colnth( disa, rep( k[j], nu) )
for (l in 1:nu) {
ind <- which(disa[, l] <= g1[l] )
a <- Rfast::Table( ina[ind] )
b <- as.numeric( names(a) )
g[l, j] <- b[which.max(a)]
} ## end inner for
} ## end outer for
} ## end if (type == "NS")
} ## end if (length(k) == 1)
} ## end of other methods
colnames(g) <- paste("k=", k, sep = "")
g
}
In order to choose the optimal pair of α and k, the metric and any extra arguments, you
can use the next function. This function requires the fields library for the graphics, created
by Nychka et al. (2015), but never the less, R has a built-in function for image plots should
you wish not to download this package.
The idea, is to use a grid of values of α and k and for every combination of these two
values to estimate the percentage of correct classification. Then, you can do the same for
the different metrics and mean or median and so on. The estimation of the rate of correct
classification is done in the same way as in the previous functions. K-fold cross-validation
is performed and the estimated bias of the rate of correct classificationTibshirani and Tibshi-
rani (2009) is subtracted from the highest estimated rate.
compknn.tune <- function(x, ina, nfolds = 10, A = 5, type = "S", mesos = TRUE,
a = seq(-1, 1, by = 0.1), apostasi = "ESOV", folds = NULL,
stratified = FALSE, seed = FALSE, graph = FALSE) {
n <- dim(x)[1] ## sample size
ina <- as.numeric(ina)
if ( A >= min(table(ina)) ) A <- min( table(ina) ) - 3 ## The maximum
## number of nearest neighbours to use
if ( min(x) == 0 ) a <- a[ a > 0 ]
if ( is.null(folds) ) folds <- Compositional::makefolds(ina, nfolds = nfolds,
stratified = stratified, seed = seed)
nfolds <- length(folds)
## The algorithm is repated R times and each time the estimated
244
## percentages are stored in the array per.
if (apostasi == "ESOV" | apostasi == "taxicab" | apostasi == "CS") {
a <- a[ a != 0 ]
}
per <- array( dim = c(nfolds, A - 1, length(a)) )
runtime <- proc.time()
} else {
ela <- t( colMeans(per) )
245
## The ela matrix contains the averages of the R
## repetitions over alpha and k
colnames(ela) <- paste("k=", 2:A, sep = "")
rownames(ela) <- paste("alpha=", a, sep = "")
## The code for the heat plot of the estimated percentages
if (graph) fields::image.plot(a, 2:A, ela, col = grey(1:11/11),
ylab = "k nearest-neighbours", xlab = expression(paste(alpha,
" values")), cex.lab = 1.3 )
The function above allows for 1 processor to be used only. If your computer has more
cores, then you can use the next function which uses the previous function. Also, make sure
that R (the number of repetitions in the cross validation) is a multiple of nc, the number of
cores to be used.
The idea is the same as before, but now we will use the α-distance (8.20) as a measure of
distance between two compositional vectors. We remind the reader that the α-distance con-
verges to the Euclidean distance applied to the centred log-ratio transformed data (8.21). See
the relevant paper also (Tsagris et al., 2016b).
246
## Both of these apply for the non-standard,
## algorithm, that is when type=NS
## xnew is the new dataset. It can be a single vector or a matrix
p <- dim(x)[2]
xnew <- as.matrix(xnew)
xnew <- matrix(xnew, ncol = p) ## makes sure xnew is a matrix
ina <- as.numeric(ina)
nc <- max(ina) ## The number of groups
nu <- dim(xnew)[1]
znew <- alfa(xnew, a, h = FALSE)$aff
z <- alfa(x, a, h = FALSE)$aff
if (type == "NS") {
## Non Standard algorithm
klen <- length(k)
g <- matrix(0, nu, klen)
ta <- matrix(nrow = nu, ncol = nc)
apo <- list()
for (m in 1:nc) {
disa <- Rfast::dista(znew, z[ina == m,], type = apostasi, trans = FALSE)
apo[[ m ]] <- Rfast::sort_mat(disa)[1:max(k), ]
}
for (j in 1:klen) {
for (m in 1:nc) {
if ( mesos ) {
ta[, m] <- Rfast::colmeans( apo[[ m ]][1:k[j], , drop = FALSE] )
} else ta[, m] <- Rfast::colhameans( apo[[ m ]][1:k[j], , drop = FALSE] )
}
g[, j] <- Rfast::rowMins(ta)
}
247
The next function tunes the parameters α and k via cross-validation.
if ( type == "S" ) {
runtime <- proc.time()
folds <- list()
for (i in 1:M) folds[[ i ]] <- mat[, i]
## Standard algorithm
for (i in 1:length(a) ) {
z <- alfa(x, a[i], h = FALSE)$aff
ela[i, ] <- Rfast::knn.cv(folds = folds, nfolds = M, y = ina, x = z, k = 2:A,
dist.type = apostasi, type = "C", freq.option = 1)$crit
}
runtime <- proc.time() - runtime
if ( graph ) fields::image.plot(a, 2:A, ela, col = grey(1:11/11),
ylab = "k nearest-neighbours", xlab = expression(paste(alpha, " values")) )
opt <- max(ela)
confa <- as.vector( which(ela == opt, arr.ind = TRUE)[1, ] )
res <- list( ela = ela, performance = max(ela), best_a = a[ confa[1] ],
best_k = confa[2] + 1, runtime = runtime )
## Non standard method
248
} else {
per <- array( dim = c( M, A - 1, length(a) ) ) ## The estimated percentages
for ( i in 1:length(a) ) {
for (vim in 1:M) {
id <- ina[ mat[, vim] ] ## groups of test sample
ina2 <- ina[ -mat[, vim] ] ## groups of training sample
aba <- as.vector( mat[, vim] )
aba <- aba[aba > 0]
g <- alfa.knn(x[aba, ], x[-aba, ], ina = ina2, a = a[i], k = 2:A, type = "NS",
mesos = mesos, apostasi = apostasi)
be <- g - id
per[vim, , i] <- Rfast::colmeans(be == 0)
}
}
for ( i in 1:length(a) ) ela[i, ] <- colMeans(per[, , i])
runtime <- proc.time() - runtime
if ( graph ) fields::image.plot(a, 2:A, ela, col = grey(1:11/11),
ylab = "k nearest-neighbours", xlab = expression(paste(alpha, " values")) )
opt <- max(ela)
confa <- as.vector( which(ela == opt, arr.ind = TRUE)[1, ] )
bias <- numeric(M)
for (i in 1:M) bias[i] <- opt - per[ i, confa[2], confa[1] ]
bias <- mean(bias)
performance <- c(opt - bias, bias)
names(performance) <- c( "rate", "bias" )
res <- list( ela = ela, performance = performance, best_a = a[ confa[1] ],
best_k = confa[2] + 1, runtime = runtime )
} ## end if (type == "S")
res
Tsagris et al. (2016b) proposed a more general discriminant analysis for compositional data
by employing the α-transformation (8.15). The idea is simple, apply the α-transformation
and then use the classical regularised discriminant analysis we saw in Section 4.3.5. The
function below predicts the group of new observations for some given values of α, δ and γ.
249
## ina is the grouping variable
## a is the value of the alpha
## gam is between pooled covariance and diagonal
## gam*Spooled+(1-gam)*diagonal
## del is between QDA and LDA
## del*QDa+(1-del)*LDA
y <- alfa(x, a)$aff ## apply the alpha-transformation
ynew <- alfa(xnew, a)$aff
rda(xnew = ynew, x = y, ina = ina, gam = gam, del = del)
}
In order to tune the values of the parameters we perform a cross validation. Parallel
computing is an option in the function rda.tune and thus is an option here as well.
for ( k in 1:length(a) ) {
250
z <- Compositional::alfa(x, a[k])$aff ## apply the alpha-transformation
mod <- Compositional::rda.tune(x = z, ina = ina, nfolds = nfolds,
gam = gam, del = del, ncores = ncores,
folds = folds, stratified = stratified, seed = seed)
## since seed is TRUE, for every value of alpha, the same splits will occur
## thus, the percentages for all values of alpha are comparable
props[, , k] <- mod$percent
ser[, , k] <- mod$se
info[[ k ]] <- mod$per
}
251
9 Circular (or angular) data
Another important field of statistics is the analysis of directional data. Directional data are
data which lie on the circle, sphere and hypersphere (sphere in more than 3 dimensions).
Some reference books include Fisher (1995) and Jammalamadaka and Sengupta (2001) for
circular data, Fisher et al. (1987) for spherical data and Mardia and Mardia (1972) and Mar-
dia and Jupp (2000) for directional statistics. A more recent book (for circular statistics only)
written by Pewsey et al. (2013) contains a lot of R scripts as well. We will start with circular
data and then move on to spherical and hyperspherical data. There are also some R pack-
ages, CircStats by Lund and Agostinelli (2012), circular by Agostinelli and Lund (2011) and
NPCirc by Oliveira et al. (2013) (nonparametric smoothing methods) for circular data and
movMF by Hornik and Grün (2014) for mixtures of von Mises-Fisher distribution (circular,
spherical or hyper-spherical). The functions described here (and a few more) exist as an R
package as well Directional Tsagris et al. (2016a).
The space of directional data is such that for any vector x ∈ Rq with q ≥ 2 we have that
kXk = x T x = 1. This mean that x is a unit vector since its length is 1. The space of such
vectors will be denoted by Sq−1 . If q = 2, the x lies on a circle and if q = 3 it lies on the
surface of a sphere.
At first we start with the circular data analysis, that is, data defined on the circle. Thus
their space is denoted by S1 . Even though they can be treated as univariate I decided to in-
clude them here, first because they still need multivariate analysis some times and secondly,
because they are good to know before we proceed to the spherical and hyper-spherical data.
1 n 1 n
n i∑ ∑ sin ui .
C̄ = cos u i and S̄ = (9.1)
=1
n i =1
The sample circular mean, or mean direction is given by (Mardia and Jupp, 2000)
( )
tan−1 (S̄/C̄ ) if C̄ > 0
θ̄ = − 1
tan (S̄/C̄ ) + π if C̄ < 0
252
If you do the above calculations in R, you will see that the outcome is not always correct.
More conditions are required, see for example Jammalamadaka and Sengupta (2001). So, to
avoid any confusion, I decided to use what Presnell et al. (1998) mentions; the angular mean
will be given by
" ! #
β 2T xi
θ̂ = tan−1 + πI β 1T xi < 0 mod2π,
β 1T xi
• R̄ ≤ 2/3
1/2
2 2
2n 2R − nχ1,α
−1
θ̄ ± cos
R2 4n − χ21,α
• R̄ > 2/3
h i1/2
n2 − n2 − R2 exp χ21,α /n
θ̄ ± cos−1
R
1
f (u) = eκ cos (u−θ̄ ) , (9.2)
2πI0 (κ )
253
and independent of θ̄, like the variance in the normal distribution, is what we have to do.
We will see this distribution again in Section (10.7.1), but in its generalization for spherical
and hyper-spherical data.
An option to plot the data on the unit circle is also given. We first construct a unit circle
and then plot the (cos (u) , sin (u)) pair of points. By default this option is set to TRUE. The
R code with these summary measures is given below.
254
fact <- sqrt( n^2 - (n^2 - R^2) * exp( qchisq(0.05, 1)/n ) )/R
ci <- c(mesos - acos(fact), mesos + acos(fact))
}
if ( !rads ) {
mesos <- mesos * 180/pi
ci <- ci * 180/pi
}
if ( plot ) {
r <- seq(0, 2 * pi, by = 0.01)
plot(cos(r), sin(r), type = "l", xlab = "Cosinus", ylab = "Sinus")
xx <- seq(-1, 1, by = 0.1)
yy <- seq(-1, 1, by = 0.1)
ta <- numeric(length(xx))
lines(ta, xx, type = "l", lty = 2)
lines(yy, ta, lty = 2)
points(cos(u), sin(u))
}
res <- list( mesos = mesos, confint = ci, kappa = kappa, MRL = MRL,
circvariance = circv, circstd = circs, loglik = mod$objective - n * log(2 * pi) )
}
res
}
255
mu <- c(cos(m), sin(m))
if (k > 0) { ## draw from a von Mises distribution
x <- rvmf(n, mu, k) ## sample from a von Mises
## x is expressed in cartesian coordinates. We have to
## transform the values into degrees or radians
u <- (atan(x[, 2]/x[, 1]) + pi * I(x[, 1] < 0)) %% (2 * pi)
} else u <- runif(n, 0, 2 * pi) ## draw from a von Mises
if ( !rads ) u <- u * pi/180 ## should the data be in degrees?
u
}
n cos (θ −θi )
1
n2πI0 (1/h2 ) i∑
fˆ (θ, h) = e h2 , (9.3)
=1
where n denotes the sample size and h is the bandwidth parameter whose value we have to
choose.
Taylor (2008) provided a ”plug-in rule” for the value of h which is based on the mini-
mum asymptotic mean integrated squared error (AMISE). Taylor (2008) mentions this is of
a similar asymptotic form as the normal-scale plug in rule. I present the formula given by
Garcı́a-Portugués (2013), because it looks better
" 1
#1/5
4π 2 I0 (κ̂ )2
h TAY =
3κ̂ 2 nI2 (2κ̂ )
Garcı́a-Portugués (2013) provided a new rule of thumb which is the directional analogue of
the rule of thumb of Silverman (1986)
( 1
)1/5
4π 2 I0 (κ̂ )2
h ROT = .
κ̂n [2I1 (2κ̂ ) + 3κ̂ I2 (2κ̂ )]
In both cases, κ̂ is the estimated (see the circ.summary function) concentration parameter.
Alternatively, we can use maximum likelihood cross validation (5.5) just as in the multi-
variate kernel density estimation for Euclidean data we saw before. So, choose h which
maximises
1 n h i
n i∑
MLCV (h) = log ˆ
f −i ( x i ; h ) ,
=1
256
where fˆ−i (xi ; h) is the von Mises kernel (9.3) of the i-th observation, calculated without it.
The vm.kde function calculates the density estimate of each point in a sample using either
a rule of thumb or another h.
if ( is.null(h) ) {
if (thumb == "tay") {
k <- circ.summary(u, rads = TRUE, plot = FALSE)$kappa
h <- ( (4 * pi^0.5 * besselI(k, 0)^2)/(3 * n * k^2 *
besselI(2 * k, 2)) )^0.2
} else h <- h
list( h = h, f = as.vector(f) )
}
257
The vmkde.tune chooses h using maximum likelihood cross validation (5.5).
options(warn = -1)
nc <- ncores
val <- matrix(h, ncol = nc) ## if the length of h is not equal to the
## dimensions of the matrix val a warning message should appear
## but with suppressWarnings() you will not see it
cl <- makePSOCKcluster(nc)
registerDoParallel(cl)
ww <- foreach(j = 1:nc, .combine = cbind) %dopar% {
ba <- val[, j]
for (l in 1:length(val[, j])) {
A <- expa^( 1 / ba[l]^2 )
f <- rowSums(A)/((n - 1) * 2 * pi * besselI( 1/ba[l]^2, 0) )
ba[l] <- mean( log(f) )
}
return(ba)
}
stopCluster(cl)
cv <- as.vector(ww)[1:length(h)]
if ( plot ) {
plot(h, cv, type = "l")
}
258
list(hopt = h[which.max(cv)], cv = cv)
}
The next code is a faster version of the previous, especially if you do not have a multi-
core computer.
θ̄1 = . . . = θ̄ g ,
259
9.5.1 High concentration F test
Let us define C = C̄ and S = S̄, where C̄ and S̄ are defined in (9.1). The resultant length
√
is given by R = C2 + S2 or as R = n R̄. Let Ri define the resultant length of each group
and R the resultant length of all the data together, assuming they are one sample. Finally,
g
let ni denote the sample size of the i-th sample, where i = 1, . . . , g and n = ∑i=1 ni . The test
statistic is
g
( n − g ) ∑ i =1 R i − R
Fww = g .
( g − 1 ) n − ∑ i =1 R i
Mardia and Jupp (2000) mention that this refinement is adequate for κ ≥ 1, i.e. R̄ ≥ 0.45
and that for κ̂ ≥ 10 the refinement factor is negligible. They cite Stephens (1972) for this
information. I wrote the following R code based on the information I found in Jammala-
madaka and Sengupta (2001), who mention that if 1 ≤ κ̂ ≤ 2, the refinement can be used.
How do we estimate this common κ̂, or, this pooled estimate? In order to estimate κ in
the one sample case we have to maximise the log-likelihood of the von Mises distribution
with respect to κ. Mardia and Jupp (2000) use the following equation for estimating the κ̂
where R̄, I remind the reader, is the mean resultant length of the combined sample, assuming
that all samples are one sample. A solver, like uniroot in R, is used to find the solution to
(9.4).
260
x1 <- cos(u)
x2 <- sin(u)
Ci <- rowsum(x1, ina)
Si <- rowsum(x2, ina)
R <- sqrt(C^2 + S^2) ## the resultant length based on all the data
## Next we stimate the common concentration parameter kappa
} else {
Ft <- NA
pvalue <- NA
}
261
the combined sample respectively, defined as
ūi. ū..
x̄i. = and x̄.. = .
kūi. k kū.. k
g
w = κ̂ ∑ Ri kūi. − ū.. k2 .
i =1
g
Alternatively, we can write it as w = 2κ̂ ∑i=1 Ri 1 − cos θ̄i − θ̄ , where θ̄ is the mean of
the combined sample. Under the null hypothesis w follows asymptotically a χ2g−1 . In the
following R code I used the first form.
ni <- tabulate(ina)
mi <- rowsum(x, ina) / ni
mi <- mi/sqrt(rowSums(mi^2)) ## mean direction of each group
m <- Rfast::colmeans(x)
m <- m/sqrt(sum(m^2)) ## mean direction based on all the data
m <- matrix(rep(m, g), nrow = g, byrow = T)
R <- sqrt( sum( colSums(x)^2 ) ) ## the resultant length based on all the data
262
## kappa is the estimated concentration parameter based on all the data
A third test statistic mentioned in Mardia and Jupp (2000) is based on the embedding ap-
proach. I give the test statistic straight away with not too much theory (the interested reader
will see in Mardia and Jupp (2000) that it is less than 1 page) which is similar to the high
concentration test statistic we saw before. Mardia and Jupp (2000) has a mistake in the form
of this statistic (Equations (7.4.15) and (7.4.16)) on page 139. The factor n is missing and I
show the correct form here.
g
(n − g) ∑i=1 ni R̄2i − n R̄2
F= g
( g − 1) n − ∑i=1 ni R̄2i
263
x2 <- sin(u)
Ci <- rowsum(x1, ina)
Si <- rowsum(x2, ina)
Rbi <- sqrt( Ci^2 + Si^2 )/ni
C <- sum(Ci)
S <- sum(Si)
Rbar <- sqrt(C^2 + S^2)/n ## the mean resultant length based on all the data
## Next we estimate the common concentration parameter kappa
Mardia and Jupp (2000) provides a test for testing the equality of the concentration param-
eter among g samples, where g ≥ 2. There are three distinct cases, based on the value of R̄,
the mean resultant length of all data.
• Case I. R̄ < 0.45. The test statistic has the following form
g g 2
∑i=1 wi g1 (2 R̄i )
U1 = ∑ wi g1 (2 R̄i ) −
2
g ,
i =1 ∑ i =1 w i
4( n −4)
wherewi = i3 , with ni denoting the sample size of the i-th group and g1 ( x ) =
q
−1 3
sin 8 x . R̄i is the mean resultant length of the i-th group.
g g 2
∑i=1 wi g2 ( R̄i )
U2 = ∑ wi g2 ( R̄i ) −
2
g ,
i =1 ∑ i =1 w i
264
n i −3
where wi = 0.798 and g2 ( x ) = sinh−1 x −1.089
0.258 .
• Case III. R̄ > 0.70. For this high concentration case the test statistic is
g
" ! #
g
n − ∑ i =1 R i ni − Ri
1
U3 = ν log − ∑ νi log ,
1+d ν i =1
νi
1 g 1 1
where νi = ni − 1, ν = n − g and d = 3( g −1) ∑ i =1 νi − ν .
265
} else if (Rb >= 0.45 & Rb <= 0.7) {
## case 2
g2 <- wi <- numeric(g)
wi <- (ni - 3)/0.798
g2 <- asinh( (Rb - 1.089) / 0.258 )
U2 <- sum( wi * g2^2) - (sum(wi * g2) )^2/sum(wi)
pvalue <- pchisq(U2, g - 1, lower.tail = FALSE)
mess <- paste(’The mean resultant length is between 0.45 and 0.7.
U3 was calculated’)
9.5.5 Tangential approach for testing the equality of the concentration parameters
Mardia and Jupp (2000) refer to Fisher (1995) who recommends this test on the grounds of its
robustness against outliers and departure from the von Mises distribution. The test statistic
is
g 2
(n − g) ∑i=1 ni d¯i − d¯
F= g 2 ,
( g − 1) ∑ ∑ni dij − d¯i
i =1 j =1
266
where
dij = sin uij − θ̄i , j = 1, . . . , ni
g g ni
1 1
d¯i =
ni ∑ dij and d = ∑ ∑ dij .
¯
n i =1 j =1
i =1
267
9.5.6 Analysis of variance without assuming equality of the concentration parameters
For the heterogeneous case, when the concentration parameters cannot be assumed to be
equal, Mardia and Jupp (2000) provides us with a test statistic. The form of this statistic is
!
g
T=2 ∑ κ̂i Ri − Rw ,
i =1
where
!2 !2 1/2
g g
Rw = ∑ κ̂i Ri cos θ̄i + ∑ κ̂i Ri sin θ̄i
i =1 i =1
and the κ̂i and Ri quantities are estimated for each sample separately. Under H0 , T follows
a χ2g−1 distribution. Mardia and Jupp (2000) informs us that this test was introduced by
Watson (1983a).
268
pvalue <- pchisq(Ta, g - 1, lower.tail = FALSE)
res <- c(Ta, pvalue)
names(res) <- c(’test’, ’p-value’)
res
Jammalamadaka and Sarma (1988) suggested a correlation coefficient for a sample of pairs
of angular data (αi , β i ) with i = 1, . . . , n. The correlation is defined as
where ᾱ and β̄ are the mean directions of the two samples. We saw in the previous section
how to calculate them. Jammalamadaka and Sengupta (2001) states that under a suitable
transformation we can get asymptotic normality and thus perform the hypothesis testing of
zero correlation. If the sample size n is large enough, then under the null hypothesis that
the true correlation is zero we have that
s
√ λ̂02 λ̂20
n rc ∼ N (0, 1) ,
λ̂22
where
1 n
∑ sini (αi − ᾱ) sin j β i − β̄ .
λ̂ij =
n i =1
This is an asymptotic normality based test and below I provide the relevant R code.
269
}
Mardia and Jupp (2000) mention another correlation of pairs of circular variables θ and φ.
They say that it is a measure of dependence between u and v, where u = (cos Θ, sin Θ) T
and v = (cos Φ, sin Φ) T . This is a squared correlation coefficient, so it only takes positive
values and is defined as
2 + r 2 + r 2 + r 2 + 2 (r r + r r ) r r − 2 (r r + r r ) r − 2 (r r + r r ) r
2 rcc cs sc ss cc ss cs sc 1 2 cc cs sc ss 2 cc sc cs ss 1
r = 2
2
,(9.6)
1 − r1 1 − r2
where rcc = corr (cos θ, cos φ), rcs = corr (cos θ, sin φ), rsc = corr (sin θ, cos φ), rss = corr (sin θ, sin φ),
r1 = corr (cos θ, sin θ ) and r2 = corr (cos φ, sin φ).
270
if ( !rads ) {
theta <- theta * pi/180
phi <- phi * pi/180
}
Mardia and Jupp (2000) mention a correlation coefficient when we have a euclidean variable
(X) and a circular variable (Θ). The formula is the following
where r xc = corr ( x, cos θ ), r xs = corr ( x, sin θ ) and rcs = corr (cos θ, sin θ ) are the classical
Pearson sample correlation coefficients.
271
If X and Θ are independent and X is normally distributed then
(n − 3) R2xθ
∼ F2,n−3 .
1 − R2xθ
Since the F distribution is asymptotic we can a use non parametric bootstrap to calculate the
p-value as well. In the following R function bootstrap is not implemented. But, the code
works for many euclidean variables. If for example a matrix is supplied, all the correlations
of the circular variable with the euclidean variables will be calculated.
## linear-circular correlation
Ft <- (n - 3) * R2xt/(1 - R2xt) ## F-test statistic value
pvalue <- pf(Ft, 2, n - 3, lower.tail = FALSE)
res <- cbind(R2xt, pvalue)
colnames(res) <- c(’R-squared’, ’p-value’)
res
Fisher and Lee (1992) used the von Mises distribution (defined on the circle) to link the mean
of some angular data with a covariate. This means that the response variable is a circular
variable and the explanatory variables are not defined on the circle.
The density of the von Mises distribution was defined in (9.2). Fisher and Lee (1992)
suggested two models. The first one models the mean direction only and the second (the
272
mixed one) models the concentration parameter as well. In the first example the mean angle
T
µ is linked with the explanatory variables (X = x1 , . . . , x p ) via
µ = α + g β T X , where g ( x ) = 2 tan−1 ( x ) .
In the mixed model case the concentration parameter is also linked with the explanatory
variables via an exponential function to ensure that it stays always positive
TX
κ = eγ+δ .
The estimates of the parameters are obtained via numerical optimisation of the log-
likelihood of the von Mises distribution (9.2). We decided not to include a r function though
since this model has some numerical problems (Pewsey et al., 2013). We mention the way
though so that the reader is aware of this model also. The package circular offers this type
of regression.
Presnell et al. (1998) used the projected bivariate normal (Watson, 1983b) to perform circular
regression. The density of the projected normal in the circular case can be written as
γ cos (θ − ω ) Φ (γ cos (θ − ω ))
1 − γ2
f (θ ) = e 2 1+ , (9.7)
2π φ (γ cos (θ − ω ))
where θ represents the angle, ω is the mean direction and Φ (.) and φ (.) are the standard
normal probability and density function respectively. An estimate
of the concentration pa-
rameter is given by γ = kµ k (Presnell et al., 1998), where µ = n ∑i=1 cos ui , n1 ∑in=1 sin ui is
1 n
the mean vector of the data in the Euclidean coordinates. However, I did some simulations
and I think Presnell et al. (1998) was a bit wrong. The estimate of the concentration param-
eter is γ = kµ k2 . Generate some data from the von Mises distribution using the rvonmises
function we saw before and apply the circ.summary we saw before and the spml function
given below and you will see some similarities. This is not random and it was noticed in
Watson (1983b).
However, (9.7) is the expression of the projected normal in the circular case. The general
form of the density is
" #
1 − γ2 γu T η Φ γu T η
f (u) = e 2 1+ , (9.8)
2π φ (γu T η )
273
We will write its associated log-likelihood as
" #
1 n n uTµ Φ uTµ
` (B) = − ∑ i i + ∑ log 1 +
µ T
µ − n log (2π ) ,
2 i =1 i =1
φ (u T µ )
where µ i = B T xi is the bivariate mean vector of the projected normal linearly linked with
some covariates x, B is the matrix of parameters and n is the sample size. Thus, in order to
apply the projected normal bivariate linear model we must first bring the angles θi onto the
circle as ui = (cos (θi ) , sin (θi )).
The matrix of the parameters is a ( p + 1) × 2 matrix, where p is the number of indepen-
dents variables
β 01 β 02
β 11 β 12
B = (β 1 , β 2 ) = .
..
.. .
β p1 β p2
Let me now explain what these matrices stand for. M is a diagonal matrix
T T T T
M = M B̂ = diag ψ̇ u1 B x1 , . . . , ψ̇ un B xn ,
Φ(t)
with ψ̇ (t) = t + φ(t)+Φ(t) , where φ (t) and Φ (t) are the probability density and the cumu-
lative distribution functions respectively of the standard normal distribution. The matrix U
contains two columns
U = ( u1 , . . . , u n ) T ,
274
where ui are the cosinus and sinus of the angle, mentioned before.
The Hessian matrix (second derivative of the log-likelihood) is given by
" #" #" #
−1 + ψ̈ u T µ C2 ψ̈ u T µ CS XT 0
X 0
`¨ (B) = ,
ψ̈ u T µ CS −1 + ψ̈ u T µ S2 0 XT
0 X
where
2
Φ (t) Φ (t)
ψ̈ (t) = 2 − t −
φ (t) + tΦ (t) φ (t) + tΦ (t)
and C = [cos (θ1 ) , . . . , cos (θn )] and S = [sin (θ1 ) , . . . , sin (θn )]. The 2 ( p + 1) × 2 ( p + 1)
(we implement a bivariate regression with p covariates and 1 constant term) matrix with the
stand errors is given by the inverse of the negative Hessian matrix
−1
Var (B) = −`¨ (B)
.
As for a measure of fit of the model we provide a pseudo R2 suggested by Lund (1999).
We calculate the circular correlation coefficient (9.5) between the observed and the estimated
angles, show the p-value of the hypothesis of a zero correlation and then square the corre-
lation. This serves as an analogue of the R2 in the classical linear models. Actually the
paper by Lund (1999) describes another type of circular regression model, which we will
not present here (at the moment) but the reader is encouraged to have a look. In addition
you can predict the circular value of some new data if you want.
275
tau <- rowsums(u * mu)
ptau <- pnorm(tau)
i <- 2
## mono th while
while ( sum( abs(B2 - B1) ) > tol ) {
i <- i + 1
B1 <- B2
mu <- x %*% B1
tau <- rowsums(u * mu)
ptau <- pnorm(tau)
rat <- ptau / ( exp(f * tau^2)/con + tau * ptau )
psit <- tau + rat
psit2 <- 2 - tau * rat - rat^2
der <- as.vector( crossprod(x, - mu + psit * u) )
a11 <- crossprod(x, x * (psit2 * ci^2 - 1) )
a12 <- crossprod(x, x * (psit2 * ci * si ) )
a22 <- crossprod(x, x * (psit2 * si^2 - 1 ) )
der2 <- cbind( rbind(a11, a12), rbind(a12, a22) )
B2 <- B1 - solve(der2, der)
}
###
if ( seb ) {
seb <- sqrt( diag( solve( -der2 ) ) )
seb <- matrix(seb, ncol = 2)
colnames(seb) <- c("Cosinus of y", "Sinus of y")
rownames(seb) <- colnames(x)
} else seb <- NULL
loglik <- - 0.5 * sum( mu^2 ) + sum( log1p( tau * ptau * con / exp(f * tau^2) ) ) -
n * log(2 * pi)
colnames(B2) <- c("Cosinus of y", "Sinus of y")
rownames(B2) <- colnames(x)
list(iters = i, loglik = loglik, be = B2, seb = seb)
}
10 (Hyper-)spherical data
We continue with (hyper)spherical data analysis. Note that these techniques can also be
applied to circular data. For example, the von Mises-Fisher distribution in two dimensions
is simply the von Mises distribution. Thus, the following functions regarding the von Mises-
276
Fisher distribution can also be used for the von Mises. The space here is S2 if we are on the
sphere and Sq−1 if we are on the hypersphere. The functions described here (and a few
more) exist as an R package as well Directional Tsagris et al. (2016a).
u = ( x, y, z) = [cos (lat) , sin (lat) ∗ cos (long) , sin (lat) sin (long)]
At first we have to transform the latitude and longitude from degrees to radians and then
apply the change to Euclidean coordinates. Note that the vector u is a unit vector (i.e.
∑3i=1 u2i = 1). Thus, the u lies on the unit radius sphere. Note, that this transformation
was used by Kent (1982) and that is why I use it here. Chang (1986) used a more standard, I
would say, transformation
u = ( x, y, z) = [cos (lat) ∗ cos (long) , cos (lat) ∗ sin (long) , sin (lat)] .
277
## the cartesian coordinates
U <- as.matrix(U)
if (ncol(U) == 1) U <- t(U)
u <- cbind( acos(U[, 1]), ( atan(U[, 3]/U[, 2]) + pi * I(U[, 2]<0) )
%% (2 * pi) )
u <- u * 180/pi ## from rads to degrees
colnames(u) <- c("Lat", "Long")
## u is a matrix of two columns
## the first column is the latitude and the second the longitude in degrees
u
}
b − a aT b
c=
k b − a (a T b) k
Q = I p + sin (α) A + [cos (α) − 1] aa T + cc T (10.1)
278
10.3 Rotation matrices on the sphere
We will see how we can obtain a rotation matrix in SO(3) when we have the rotation axis
and the angle of rotation. The SO(3) space denotes the special orthogonal group of all 3 × 3
orthogonal matrices whose determinant is 1. In addition, the inverse of a rotation matrix
is equal to its transpose. Suppose we have the rotation axis ξ = (ξ 1 , ξ 2 ), where ξ 1 is the
latitude and ξ 2 is the longitude and the angle of rotation θ in degrees or radians. If the
θπ
angle is expressed in degrees we turn it into radians using φ = 180 . We then transform ξ
to the Cartesian coordinates as t = (cos ξ 1 cos ξ 2 , cos ξ 1 sin ξ 2 , sin ξ 1 ). Then as Chang (1986)
mentions, we construct the following matrix
A (θ ) = I + sin (θ )L + (1 − cos (θ )) L,
where
0 − t3 t2
L = t3 0 − t1
− t2 t1 0
if ( rads ) {
lat <- ksi[1]
long <- ksi[2]
the <- theta
} else {
lat <- ksi[1] * pi / 180
long <- ksi[2] * pi / 180
the <- theta * pi / 180
}
279
diag(3) + L * sin(the) + L %*% L * ( 1 - cos(the) )
}
The inverse problem, when we have a rotation matrix in SO(3) and we want to find the
rotation axis and the angle of rotation (in degrees, not radians) is not difficult to do. I took
the next information from the course webpage of Howard E. Haber. Given a 3x3 rotation
matrix A we work as follows
tr (A) − 1
−1
φ = cos
2
180φ
θ=
π
where tr (A) 6= −1, 3 and subscript (ij) denotes the (i, j) entry of the matrix A.
} else {
tr <- sum( diag(A) )
rad <- acos(0.5 * (tr - 1))
angle <- rad * 180 / pi ## from rads to degrees
ksi <- c(A[3, 2] - A[2, 3], A[1, 3] - A[3, 1], A[2, 1] - A[1, 2])/
sqrt( (3 - tr) * (1 + tr) )
axis <- c( asin(ksi[3]), atan2(ksi[2], ksi[1]) )
axis <- c(axis / pi * 180) ## from degrees to rads
280
## if the latitude or longitude are negative add 360 (degrees)
axis[axis<0] <- axis[axis<0] + 360
names(axis) <- c("latitude", "longitude")
res <- list(angle = angle, axis = axis)
}
res
Y = AX.
We wish to estimate this rotation matrix A. Chang (1986) mentions that the estimate
comes from the least squares method. He also mentions that the solution has already been
given in closed form by Mackenzie (1957) and Stephens (1979). It is a singular value decom-
position
XY T = O1Λ O2T ,
 = O2 O1T
281
if ( !rads ) {
x <- pi * x / 180 ## from degrees to rads
y <- pi * y / 180
} ## from degrees to rads
## the first row of both matrices is the latitude and the second is the longitude
## the next two rows transform the data to Euclidean coordinates
cosx1 <- cos(x[, 1]) ; cosy1 <- cos(y[, 1])
X <- cbind( cosx1 * cos(x[, 2]), cosx1 * sin(x[, 2]), sin(x[, 1]) )
Y <- cbind( cosy1 * cos(y[, 2]), cosy1 * sin(y[, 2]), sin(y[, 1]) )
} else if ( dim(x)[2] == 3 & dim(y)[2] == 3 ) {
X <- x
Y <- y
}
XY <- crossprod(X, Y) / n
b <- svd(XY) ## SVD of the XY matrix
A <- b$v %*% t(b$u)
if ( det(A) < 0 ) {
b$u[, 3] <- - b$u[, 3]
A <- tcrossprod(b$v, b$u )
}
282
Mardia and Jupp (2000) mentions that the circular-circular correlation type II we saw before
(9.6) generalizes to
r2 = tr S−
xx
1
S xy S −1
yy S yx ,
provided that the block matrices S xx and Syy are non singular. Under the H0 (independence)
nr2 ∼ χ2pq . The R code is given below.
283
10.6 Analysis of variance for (hyper-)spherical data
10.6.1 High concentration F test
Similarly to the high concentration F test for the circular data, we have a version for data in
S p−1 with p ≥ 3.
g
( n − g ) ( p − 1 ) ∑ i =1 R i − R
F= g .
( g − 1 ) ( p − 1 ) n − ∑ i =1 R i
κ̂
Fc = F,
γ̂
where κ̂ is the common estimate of the concentration parameter and is calculated by solving
the equation A p (κ̂ ) = Rn , where R is the resultant length based on all the data. The factor γ̂1
is given by
( )
1 1
1
= κ̂ − 5κ̂ 3 if p = 3
1 p −3 p −3
κ̂ − 4κ̂ 2 − 4κ̂ 3 if p > 3.
γ̂
Mardia and Jupp (2000) mention that the corrected approximated test statistic above is
adequate for κ̂ ≥ 1.
284
Ft <- ( (n - g) * (p - 1) * (sum(Ri) - R) )/( (g - 1) * (p - 1) * (n - sum(Ri)) )
if ( fc ) { ## correction is used
if (p == 3) {
Ft <- kappa * (1/kappa - 1/(5 * kappa^3)) * Ft
} else if (p > 3) {
Ft <- kappa * (1/kappa - (p - 3)/(4 * kappa^2) - (p - 3)/(4 * kappa^3)) * Ft
}
}
where α p (κ ) is given in (10.5) and κ̃ and κ̂ are the maximum likelihood estimates of κ under
H0 and H1 respectively and are given by
g
1
n i∑
A p (κ̃ ) = R̄ and A p (κ̂ ) = Ri .
=1
285
S <- rowsum(x, ina)
Ri <- sqrt( rowSums(S^2) ) ## the resultant length of each group
S <- colSums(x)
R <- sqrt( sum(S^2) ) ## the resultant length based on all the data
Rk <- R/n
k <- numeric(4)
j <- 1
k[j] <- Rk * (p - Rk^2)/(1 - Rk^2)
j <- 2
k[j] <- k[j - 1] - (Apk(p, k[j - 1]) - Rk)/(1 - Apk(p, k[j - 1])^2 -
(p - 1)/k[j - 1] * Apk(p, k[j - 1]))
286
k1 <- k[j] ## concentration parameter under H1
apk0 <- (1 - p/2) * log(k0/2) + lgamma(p/2) +
log(besselI(k0, p/2 - 1, expon.scaled = T)) + k0
apk1 <- (1 - p/2) * log(k1/2) + lgamma(p/2) +
log(besselI(k1, p/2 - 1, expon.scaled = T)) + k1
w <- 2 * (k1 * sum(Ri) - k0 * R - n * apk1 + n * apk0)
Similar to the circular data, we have the embedding approach for hyper-spherical data
as well. Mardia and Jupp (2000) has a mistake in the form of this test statistic (Equation
(10.6.19) on page 225). The factor n is missing and I show the correct form here.
g
(n − g) ( p − 1) ∑i=1 ni R̄2i − n R̄2
F= g .
( g − 1) ( p − 1) n − ∑i=1 ni R̄2i
287
Rbar <- sqrt( sum(S^2) ) ## the mean resultant length based on all the data
10.6.4 A test for testing the equality of the concentration parameters for spherical data
only
Mardia and Jupp (2000) provides a test for testing the equality of the concentration parame-
ter among g samples, where g ≥ 2 in the case of spherical data only. There are three distinct
cases, based on the value of R̄, the mean resultant length of all data.
• Case I. R̄ < 0.44. The test statistic has the following form
g g 2
∑i=1 wi g1 (3 R̄i )
U1 = ∑ wi g1 (3 R̄i ) −
2
g ,
i =1 ∑ i =1 wi
5( n −5)
i
where wi = 3 , with ni denoting the sample size of the i-th group and g1 (r ) =
−1 √r
sin . R̄i is the mean resultant length of the i-th group.
(5)
g g 2
∑i=1 wi g2 ( R̄i )
U2 = ∑ wi g2 ( R̄i ) −
2
g ,
i =1 ∑ i =1 wi
n i −4
where wi = 0.394 and g2 ( x ) = sin−1 r +0.176
1.029 .
• Case III. R̄ > 0.67. For this high concentration case the test statistic is
g
" ! #
g
n − ∑ i =1 R i ni − Ri
1
U3 = ν log − ∑ νi log ,
1+d ν i =1
νi
1 g
where νi = 2 (ni − 1), ν = 2 (n − g) and d = 3( g −1) ∑i=1 ν1i − 1
ν .
288
spherconc.test <- function(x, ina) {
## x contains all the data
## ina is an indicator variable of each sample
if (p == 3) {
S <- rowsum(x, ina) / ni
Rbi <- sqrt( Rfast::rowsums(S^2) ) ## the mean resultant length of each group
S <- Rfast::colmeans(x)
Rb <- sqrt( sum(S^2) ) ## the mean resultant length of all the data
if ( Rb < 0.44 ) {
## case 1
g1 <- wi <- numeric(g)
wi <- ( 5 * (ni - 5) ) / 3
g1 <- asin(3 * Rbi/sqrt(5))
U1 <- sum(wi * g1^2) - ( sum(wi * g1) )^2/sum(wi)
stat <- U1
pvalue <- pchisq(stat, g - 1, lower.tail = FALSE)
mess <- paste(’The mean resultant length is less than 0.44. U1 was calculated’)
289
} else if ( Rb > 0.67 ) {
## case 3
Ri <- Rbi * ni
vi <- 2 * (ni - 1)
v <- 2 * (n - g)
d <- 1/(3 * (g - 1)) * ( sum(1/vi) - 1/v )
U3 <- 1/(1 + d) * ( v * log( (n - sum(Ri) )/v ) - sum( vi * log( (ni - Ri)/vi) ) )
stat <- U3
pvalue <- pchisq(U3, g - 1, lower.tail = FALSE)
mess <- paste(’The mean resultant length is more than 0.67. U3 was calculated’)
}
} else {
stat <- NA
pvalue <- NA
mess <- paste("This test is valid only for spherical data")
}
When the concentration parameters of the different samples cannot be assumed equal we
can use the following test statistic (Mardia and Jupp, 2000)
!
g
g
T = 2 ∑ κ̂i ni kx̄i k −
∑ κ̂i ni x̄i
,
i =1
i =1
where κ̂i = A− 1 2
p ( R̄i ). Under H0 , the large sample asymptotic distribution of T is χ( g−1)( p−1) .
290
x <- as.matrix(x)
x <- x / sqrt( Rfast::rowsums(x^2) ) ## makes sure x are unit vectors
p <- dim(x)[2] ## dimensionality of the data
n <- dim(x)[1] ## sample size of the data
ni <- as.vector(table(ina)) ## group sample sizes
for (i in 1:g) {
kappa[i] <- vmf( x[ina == i, ] )$kappa
}
The von Mises-Fisher distribution is the generalization of the von Mises distribution (on the
circle) to the sphere in R3 (or S2 ) and the hypersphere in R p (or S p−1 ) (p > 3). Its density is
given by
T
f p (x; µ, κ ) = C p (κ ) exp κµ
µ x , (10.2)
where
κ p/2−1
κ ≥ 0, kµ k = 1 and C p (κ ) = ,
(2π ) p/2 I p/2−1 (κ )
where Iv (z) denotes the modified Bessel function of the first kind and order v calculated at
z.
Maximum likelihood estimation of the parameters does not require numerical optimiza-
tion of the corresponding log-likelihood. The estimated mean direction is available in closed
291
form given by
x̄
µ=
µ̂ ,
k x̄ k
where k · k denotes the Euclidean norm on Rd . The concentration parameter though needs
two steps of a truncated Newton-Raphson algorithm (Sra, 2012).
A p κ̂ (t−1) − R̄
κ̂ (t) = κ̂ (t−1) − 2 , (10.3)
κ̂ (t−1) − κ̂p(t−−11) A p κ̂ (t−1)
1 − Ap
where
I p/2 (κ̂ ) k ∑in=1 xi k
A p κ̂ (t−1) = = = R̄, (10.4)
I p/2−1 (κ̂ ) n
and I p (κ̂ ) is the modified Bessel function of the first kind (see Abramowitz and Stegun
R̄( p− R̄2 )
(1970)). Similarly to Sra (2012) we will set κ̂ (0) = 1− R̄2 to (10.3). The variance of κ̂ is given
by (Mardia and Jupp, 2000)
−1
A p (κ̂ )
2
var (κ̂ ) = n 1 − − A p (κ̂ )
κ̂
The modified Bessel function in R gives us the option to scale it exponentially. This
means, that it calculates this quantity instead I p (κ̂ ) exp−κ̂ . This is useful because when large
numbers are plugged into the Bessel function, R needs the exponential scaling to calculate
the ratio of the two Bessel functions. Note that we can use this to calculate the parameters
of the von Mises distribution as well, since the von Mises distribution is simply the von
Mises-Fisher distribution on the circle, with p = 2.
x <- as.matrix(x)
x <- x / sqrt( Rfast::rowSums(x^2) )
p <- dim(x)[2] ## dimensionality of the data
n <- dim(x)[1] ## sample size of the data
292
}
m1 <- Rfast::colsums(x)
R <- sqrt( sum(m1^2) )/n ## mean resultant length
m <- m1 / (n * R)
k <- numeric(4)
i <- 1
k[i] <- R * (p - R^2)/(1 - R^2)
Alternatively and perhaps easier, if you want to estimate the concentration parameter
κ you can solve the equation (10.4) numerically (function uniroot) and thus substitute the
Newton-Raphson algorithm from the above function. Another way is to optimize, numeri-
cally, the log-likelihood with respect to κ. After calculating the mean direction, simply use
the function optimize and that’s it. If you calculate the log-likelihood with respect to κ for a
number of values of κ and then plot it, you will see its curve graphically.
293
10.7.2 (Hyper-)spherical median direction
Fisher (1985) introduced the idea of the spherical median direction. It is the unit vector m
which minimizes the sum of the arc distances of all the points
n
∑ cos −1
xiT m .
i =1
mediandir_2 = function(x) {
## x is the directional data
pa <- Rfast::colMedians(x)
bar <- nlm( funa, pa, iterlim = 10000 )
bar <- nlm( funa, bar$estimate, iterlim = 10000 )
bar <- optim( bar$estimate, funa, control = list(maxit = 10000) )
med <- bar$par
med / sqrt( sum(med^2) )
However, time can become an issue, especially with large scale data and given that this
is a numerical optimiser errors can occur. For example, the correct estimated median might
not be found, but a very near point can be returned. For these two reasons I will also provide
a fixed-point iteration solution, which was given by Cabrera and Watson (1990) and is much
faster and robust (or more reliable if you prefer).
n xiT
m(k+1) = unit vector parallel to ∑ q 2
i =1 1 − xiT m(k)
294
p <- dim(x)[2]
pa <- Rfast::colmeans(x)
u1 <- pa / sqrt( sum(pa^2) )
ww <- as.vector( sqrt( 1 - ( x %*% u1 )^2 ) )
u2 <- Rfast::colsums(x / ww )
u2 <- u2 / sqrt( sum( u2^2 ) )
u2
}
where
p/2−1
1/h2
C p (h) = ,
(2π ) p/2 I p/2−1 (1/h2 )
So, it’s pretty much the same as (10.2), but instead of κ there is 1/h2 and instead of µ we
have xi and an average in the front.
How does on choose h? The same question we have seen again before. Either using a
rule of thumb (Garcı́a-Portugués, 2013) or by maximum likelihood cross validation (5.5). Let
us say q = p − 1, where p is the number of variables, or dimensions in the Euclidean space,
295
so S p−1 = Sq . The rule of thumb by Garcı́a-Portugués (2013) is
i 16
8 sinh2 (κ̂ )
h
2 , q=2
κ̂n[(1+4κ̂ ) sinh (2κ̂ )−2κ̂ cosh (2κ̂ )] 1
1
h ROT = 4π 2 I q−1 (κ̂ )2
4+ q
2
, q≥2
q +1
κ̂ 2 n 2qI (2κ̂ )+(2+q)κ̂I (2κ̂ )
q +1 q +3
2 2
The following R code calculates the kernel density estimates of a directional data sample.
The next R code chooses the value of h via maximum likelihood cross validation (5.5).
296
runtime <- proc.time()
for ( j in 1:length(h) ) {
A <- d/h[j]^2
cpk <- ( (1/h[j]^2)^(p/2 - 1) ) / ( (2 * pi)^(p/2) * besselI(1/h[j]^2, p/2 - 1) )
f <- rowSums( exp(A + log(cpk)), na.rm = T )/(n - 1)
cv[j] <- mean(log(f))
}
The next code is a bit faster, since it uses the optimize and not for loop.
297
a <- optimize(funa, c(low, up), maximum = TRUE)
res <- c(a$maximum, a$objective)
names(res) <- c("Optimal h", "cv")
res
}
The von Mises-Fisher distribution is a fundamental distribution for directional data. How-
ever, there is a simpler one, the uniform distribution on the (hyper)sphere (or circle of
course). If the concentration parameter κ of the von Mises-Fisher distribution is 0, then
we end up with the uniform distribution. Mardia et al. (1979) and Mardia and Jupp (2000)
mention the Rayleigh test for testing the null hypothesis that κ = 0 against the alternative
of κ > 0. They mention that under the null hypothesis
T = np R̄2 ∼ χ2p ,
p
k∑i =1 i xk
where n and p are the sample size and the number of dimensions and R̄ = n also
given in (10.4). Mardia et al. (1979, pg. 440) mentions that the case of p = 3 was first proved
by Rayleigh (1919).
298
The error in the above approximation of the test statistic is of order O n−1 . In Mardia
and Jupp (2000) a better approximation can be found which reduces the error to O n−2
The function below offers the possibility of a parametric bootstrap calculation of the p-
value, for the non modified test statistic. We remind that we must simulate from a multivari-
ate normal with the zero vector as the mean vector and the identity as the covariance matrix.
We then project the values on to the (hyper)sphere and this results into the uniform distri-
bution on the (hyper)sphere. Thus we generate values from a uniform many times in order
to do the parametric bootstrap (simulating under the null hypothesis, that of uniformity).
rayleigh <- function(x, modif = TRUE, B = 999) {
## x contains the data in Euclidean coordinates
## B is by default eaual to 999 bootstrap samples
## If B==1 then no bootstrap is performed
p <- dim(x)[2] ## dimensionality of the data
n <- dim(x)[1] ## sample size of the data
m <- Rfast::colsums(x)
test <- sum( m^2 ) *p / n
if ( modif ) {
test <- ( 1 - 1/(2 * n) ) * test + test^2 / ( 2 * n * (p + 2) )
}
if (B == 1) {
pvalue <- pchisq(test, p, lower.tail = FALSE)
res <- c(test, pvalue)
names(res) <- c(’test’, ’p-value’)
} else {
tb <- numeric(B)
for (i in 1:B) {
x <- matrix( RcppZiggurat::zrnorm(n * p), ncol = p )
x <- x / sqrt( Rfast::rowsums(x^2) )
mb <- Rfast::colsums(x)
tb[i] <- p * sum( mb^2 ) / n
}
299
res <- c( test, (sum(tb > test) + 1)/(B + 1) )
names(res) <- c(’test’, ’Bootstrap p-value’)
}
res
where
κ p
+ log Γ
α p (κ ) = (1 − p/2) log + log I p/2−1 (κ ) . (10.5)
2 2
Under the null hypothesis w ∼ χ2p−1 . The next R function offers a bootstrap calibration
of the test statistic. To transform the data under the null hypothesis, we use the rotation
function we saw before.
## log-likelihood under H0
qa0 <- optimize(lik, c(0, 100000), x = x, maximum = TRUE)
k0 <- qa0$maximum ## concentration parameter under H0
apk0 <- (1 - p/2) * log(k0/2) + lgamma(p/2) +
log( besselI(k0, p/2 - 1, expon.scaled = TRUE) ) + k0
300
apk1 <- (1 - p/2) * log(k1/2) + lgamma(p/2) +
log( besselI(k1, p/2 - 1, expon.scaled = TRUE) ) + k1
w <- 2 * n * (k1 * sqrt(sum(xbar^2)) - k0 * sum(mu * xbar) - apk1 + apk0)
if (B == 1) pvalue <- pchisq(w, p - 1, lower.tail = FALSE)
if (B > 1) {
A <- rotation(m1, mu)
y <- tcrossprod(x, A) ## bring the data under H0
## y has mean direction equal to mu
wb <- numeric(B)
for (i in 1:B) {
nu <- sample(1:n, n, replace = TRUE)
z <- y[nu, ]
k1 <- vmf(z)$k ## concentration parameter under H1
zbar <- Rfast::colmeans(z) ## z-bar
1
f (x|A, γ ) = exp −x T Ax + γ T x , (10.6)
c (A, γ )
301
diag λ1 , . . . , λ p , with 0 < λ1 ≤ . . . ≤ λ p , where λi is the i-th eigenvalue of the matrix A.
The A matrix is the Bingham part. The vector γ = γ1 , . . . , γ p is the Fisher part.
Kume and Wood (2005) derived the saddlepoint approximations to the normalizing con-
stant of the Fisher-Bingham distribution. The Fisher and the Bingham distribution can be
considered as special cases of the aforementioned distribution. Their paper is a bit technical
and usually technical papers tend to be technical and not easy to understand at a glance. For
this reason we will try to explain, briefly, the calculations required to derive the approxima-
tion. We will follow the same notation as in their paper for consistency and convenience to
the reader purposes.
Saddlepoint approximation requires a cumulant generating function as its starting point
(Butler, 2007). In this case that is given by
( )
p
1 1 γi2 γ2
Kθ ( t ) = ∑ − log (1 − t/λi ) +
2
− i
4 λi − t 4λi
( t < λ1 ) . (10.7)
i =1
The first order saddlepoint density approximation of f θ (α) (the f θ evaluated at a point α) is
h i−1/2
ˆf θ,1 (α) = 2π K̂ (2) t̂
θ exp K̂ θ t̂ − t̂ , (10.8)
(2)
where t̂ is the unique solution in (−∞, λ1 ) to the saddlepoint equation K̂θ t̂ = α and in
our case α = 1 (see the paper by Kume and Wood (2005) for more information why). In fact
the t̂ has a bounded range (it is a simple form) but we will not mention it here and t̂ can be
found accurately using numerical methods, e.g. as a root solver (available in R).
The second and third order saddlepoint density approximations of f θ (α) are given by
fˆθ,2 (1) = fˆθ,1 (1) (1 + T ) and fˆθ,3 (1) = fˆθ,1 (1) exp ( T ) respectively, (10.9)
( j)
Kθ (t̂)
where T = 18 ρ̂4 − 5 2
24 ρ̂3 , with ρ̂ j = h
(2)
i j/2 .
Kθ (t̂)
302
The Fisher-Bingham normalising constant is written as
! !
p p
1 γi2
c (λ , γ ) = 2π p/2 ∏ λi−1/2 f θ (1) exp
4 i∑ λ
, (10.10)
i =1 =1 i
The R function below calculates the saddlepoint approximations of the normalizing con-
stants of the Fisher, the Bingham and the Fisher-Bingham distribution. For the Bingham
part it only accepts the eigenvalues of the B matrix. All you need to do is give it what it
needs.
In Kume and Wood (2005) there is an important property which we should take into
account. On page 468 of their paper they state that ”A useful practical consequence of this
equivariance property is that, when using the approximation ĉk (λ, γ) we can dispense with the
restriction that the λi be strictly positive, even though, in the saddlepoint density approximation
(11), the λi do need to be positive”. But what is this equivariance property they are referring
to? This property states that
c (λ , γ ) = c λ + a1 p , γ e a .
So, in the case where one or possibly more eigenvalues of the B matrix are negative, if we
make them all positive, by adding a scalar a, then the final saddlepoint approximation to the
normalizing constant must by multiplied by the exponential of that scalar. This I would say
is a property which helps things a lot. A final notice, is that the next R function calculates
the logarithm of the normalizing constant.
If you are a Matlab user, then you are directed to Simon Preston’s homepage. In his
section Files you can find Matlab codes to calculate the saddlepoint approximations of the
Fisher-Bingham distribution. These codes were designed for the normalizing constant of the
Fisher-Bingham distributions products of spheres and Stiefel manifolds, using Monte Carlo
methods as well (see Kume et al. (2013)). A main difference the reader must notice is that in
Kume et al. (2013) the Bingham part in the Fisher-Bingham density does not have a minus
sign (−) as in our case (see (10.6), there is a minus sign). Simon’s code uses the notation of
Kume et al. (2013). Furthermore, in Simon’s section Shape analysis the interested reader will
303
find Matlab codes for shape analysis.
304
0.5 * log( kfb(2, gam, lam, tau) ) - 0.5 * sum( log(lam - tau) ) -
tau + 0.25 * sum( gam^2/(lam - tau) )
## c1 <- sqrt(2) * pi^(0.5 * (p - 1) ) * kfb(2, gam, lam, tau)^(-0.5) *
## prod(lam - tau)^(-0.5) * exp( -tau + 0.25 * sum( gam^2/(lam - tau) ) )
c2 <- c1 + log1p(Ta)
c3 <- c1 + Ta
## the next multiplications brings the modification with the negative
## values in the lambdas back
if (mina <= 0) {
c1 <- c1 + aaa
c2 <- c2 + aaa
c3 <- c3 + aaa
}
logcon <- c(c1, c2, c3)
names(logcon) <- c("first order", "second order", "third order")
logcon
}
10.7.7 Normalizing constant of the Bingham and the Fisher-Bingham distributions us-
ing MATLAB
As we mentioned before Simon Preston’s homepage contains Matlab codes for calculating
the normalizing constant of the Fisher-Bingham distribution. For those who rely more on
Matlab than R and for those who want to calculate the normalizing constant using Monte
Carlo for example or want the normalizing constant on products of spheres and stiefel man-
ifolds and do not know R the answer is here. Kwang-Rae Kim from the university of Not-
tingham helped me create a front end with Matlab. That is, implement Matlab functions
in Matlab and get the answer using only R. The user needs to have a Matlab v6 or higher
installed on his/her computer.
At first we need to connect R with Matlab. For this reason we must download the R
package R.matlab (Bengtsson, 2014). We then save the file FB.zip from Simon Preston’s
homepage into our computer. The .zip file has regular folder inside called FB norm const.
Inside FB norm const there are two folders, spheres and stiefel. We are interested in the first
folder (I do not know much about stiefel manifolds). The reader who knows can do the
same as the ones we describe below.
We take the folder spheres and save it somewhere in our computer (desktop?). You can
also unzip the FB.zip file and do the same things.
We then load the library into R and do the following steps
305
2. Type Matlab$startServer()
Wait until the server is open, wait. This will create three files in the folded spheres.
Next time you do the same work, delete them first. I do not think it affects next time,
but just in case.
3. Type matlab=Matlab()
4. Type isOpen=open(matlab)
We are almost there, Matlab, we have connection. Open the folder spheres to see what’s in
there. We are interested in two Matlab functions logNormConstSP and logNormConstMC. The
first uses saddlepoint approximation and the second uses Monte Carlo. I will show how to
use the first one only (the syntax for Monte Carlo is the same apart from an extra parameter,
n, the number of Monte Carlo samples) in the one sphere case only. For the case of products
of spheres see the function inside. Simon explains the arguments.
The function has this name logC = logNormConstSP(d,a,B,approxType). The argument
d is the number of dimensions, the argument a is the vector γ in (10.6) and the argument B
is the matrix −A in (10.6). A key thing is that in Kume et al. (2013) the Bingham part in the
Fisher-Bingham density does not have a minus sign (−) as in our case (in (10.6) there is a
minus sign). Finally approxType takes the values 1, 2 or 3 corresponding to the first (10.8),
second and third order (10.9) saddlepoint approximations. The value 4 produces a vector
with all three orders. A second key thing we must highlight is that Simon calculates the
logarithm of the constant, so the final answer should be exponentiated.
Let us calculate for example the Bingham normalizing constant. This means that γ = 0
and B is a matrix. We say that the eigenvalues of B are (1, 2, 3). This means that Simon’s
Matlab code needs the negative eigenvalues. Or in general, the negative of the matrix B we
have. Let us see this example. Type in R
$logC
[,1]
[1,] 0.6595873
306
attr(,"header")$description
[1] "MATLAB 5.0 MAT?file, Platform: PCWIN64, Created on: Wed Feb 19 11:36:59 2014
"
attr(,"header")$version
[1] "5"
attr(,"header")$endian
[1] "little"
The answer is the logarithm of the third order (10.9) saddlepoint approximation to the
normalizing constant of the Bingham distribution (the vector γ is zero). The result is the
(res$logC). Compare this with the answer from fb.saddle(c(0,0,0),c(1,2,3)).
Below we summarize the steps in two R codes. At first the user must run these com-
mands (copy and paste as they are) in order make the connection between the two programs.
Then the function one needs to use every time for calculating the Fisher-Bingham nor-
malizing constant (using saddlepoint approximation or Monte Carlo integration) given be-
low. The convenience of this function is that one does not need to know the Matlab syntax.
Note, that the input parameters are the same as in the function fb.saddle. That is, put the same
matrix B or the eigenvalues. Inside the function, I put a minus sign (−) to agree with Simon’s
code. The parameter d is a number or a vector of length equal to the number of spheres we
have (Kume et al. (2013) calculate the normalizing constant for product of spheres, not just
one sphere). If it is a number then it contains the number of dimensions of the sphere. If it is
a vector, then it contains the dimensions of the spheres. Note, all the spheres in the case have
the same dimensions. The parameter a is the Fisher part of the Fisher-Bingham distribution
and the matrix B is the Bingham part. Do not forget to change the directory of R the folder
spheres as we said before.
307
setVariable(matlab, d = d)
setVariable(matlab, a = a)
setVariable(matlab, B = -B)
if (method == "SP") {
## this does saddlepoint approximation
evaluate(matlab, "logC = logNormConstSP(d, a, B, 3) ; ")
res <- getVariable(matlab, "logC")
result <- list(norm.const = res$logC)
}
if (method == "MC") {
## this does Monte Carlo integration
evaluate(matlab, "[logC, se_logC] = logNormConstMC(d, a, B, 1e + 05) ; ")
res <- getVariable(matlab, "logC")
se.const <- getVariable(matlab, "se_logC")
result <- list(norm.const = res$logC, se.norm.const = se.const$se.logC)
}
result
}
The Kent distribution was proposed by John Kent (Kent, 1982) as a sub-model of the Fisher-
Bingham distribution on the sphere. So, I will focus on the sphere only here. It’s density
function is given by (Kent, 1982)
2 2
−1
f (x) = c (κ, β) exp α 1T x +
κα T T
β α2 x − α3 x , (10.11)
where κ, β and A = (α 1 , α 2 , α 3 ) are parameters that have to be estimated. Kent (1982) men-
tions that the κ ≤ 0 and β ≤ 0 represent the concentration and the ovalness of the distribu-
tion respectively and these two parameters will be estimated via numerical maximization of
the log-likelihood. The normalizing constant in (10.11) depends upon these two parameters
only but its calculation is almost impossible up to now. For this reason we will approximate
it using the saddlepoint approximation of Kume and Wood (2005) we saw before (see Sec-
tion 10.7.6). We need to suppose though that 2β < κ in order for the distribution to have the
correct behaviour. Note that if β = 0, then we have the von Mises-Fisher density. Finally A
is an orthogonal matrix where α 1 is the mean direction or pole, α 2 is the major axis and α 3 is
the minor axis.
308
The Fisher Bingham distribution is written as
f (x) ∝ exp κx µ + x T Ax
T T T
or as f (x) ∝ exp κx − x Ax .
µ
The first form is where (10.11) comes from but the second form is used in Kent et al. (2013)
and in Kume and Wood (2005). In the first case A = diag (0, β, − β). We will use the second
case, since the normalizing constant (Section 10.7.6) utilizes the second formula. In both
cases though, the normalizing constant depends upon κ and β only. The normalizing con-
stant we saw in Section 10.7.6 requires the γ vector and the λ vector. In the second case we
need to use γ = (0, κ, 0) T and λ = (0, − β, β) T as input values in the function fb.saddle we
saw in Section 10.7.6. In terms of Simon’s MATLAB function (see Section 10.7.7) we would
specify γ = (0, 0, κ ) T and λ = ( β, − β, 0) T .
So, the log-likelihood of the Kent distribution from (10.11) is
" #
n n
2 n 2
` = −n ∗ c (κ, β) + κ ∑ α 1T xi + β ∑ α 2T xi − ∑ α 3T xi . (10.12)
i =1 i =1 i =1
We will now describe the estimation the parameters of (10.11) as Kent (1982) mentions.
For the orthogonal matrix A we will mention the moment estimation. We must choose an
orthogonal matrix H to rotate the mean vector x̄ = n−1 (∑in=1 x1i , ∑in=1 x2i , ∑in=1 x3i ) T to the
north polar axis (1, 0, 0) T . So, H can be
cos θ − sin θ 0
H = sin θ cos φ cos θ cos φ − sin φ ,
where θ and φ are the polar co-ordinates of x̄. Let B = H T SH, where S = n−1 ∑ xi xiT . Then
choose a rotation K about the north pole to diagonalize B L , where
" #
b22 b23
BL =
b32 b33
0 sin ψ cos ψ
The moment estimate of A is given by à = HK. As for the parameters κ and β we will max-
imize (10.12) with respect to these two parameters. I repeat that we will use γ = (0, κ, 0) T
309
and λ = (0, − β, β) T as input values in the function fb.saddle we saw in Section 10.7.6. The
next R function calculates the A matrix, the κ and β and the log-likelihood and has been
tested with the data that appear in Kent (1982). Some elements in the A matrix are slightly
different (numerical errors possibly), but I do not think this is an issue.
In a recent communication I had with Professor Kent (Leeds university) he wrote this in
his e-mail: ”Note that 2| β|/κ is analogous in some ways to the correlation parameter for a bivariate
normal distribution. In particular, negative values are just as meaningful as positive values”.
## the next function will be used to estimate the kappa and beta
310
xg1 <- sum( x %*% G[, 1] )
xg2 <- sum( ( x %*% G[, 2] )^2 )
xg3 <- sum( ( x %*% G[, 3])^2 )
311
Kent (1982) gave the formula to calculate the normalising constant exactly
∞ −2j−0.5
Γ ( j + 0.5) 2j β
c (κ, β) = 2π ∑ β I2j+0.5 (κ ) , (10.13)
j =1
Γ ( j + 1 ) 2
where Iv (z) denotes the modified Bessel function of the first kind and order v calculated at
z.
I did a few experiments and saw that the saddlepoint approximation Kume and Wood
(2005) gives very accurate results, very very close to the true values. I am using the sad-
dlepoint approximation though in the function kent.mle because it is faster than the exact
calculation.
j <- 0:j
ka <- 2 * pi * gamma(j + 0.5) / gamma(j + 1)* b^( 2 * j ) *
( k / 2 )^( -2 * j - 0.5 ) * besselI(k, 2 * j + 0.5)
log( sum(ka) )
Kent (1982) proposed a test statistic to test whether a von Mises-Fisher distribution on the
sphere is preferable to a Kent distribution. To be honest, I did not make the test statistic.
Something is wrong, I did not get it and I made a mistake, I don’t know. For this reason I
will describe the test as I found it in Rivest (1986).
(e1 − µ̂ µ )T
µ ) (e1 − µ̂
P̂ = I3 − ,
1 − µ̂1
where e1 = (1, 0, 0) T and µ̂1 is the first element of the sample mean direction. Not, that
P̂ is a symmetric matrix whose first column (or first row) is the sample mean direction
µ.
µ̂
312
3. Calculate z = P̂x and take y which consists of the last two columns of the z matrix
y = (z2i , z3i ).
The R function presented below offers the possibility of non parametric bootstrap as well.
if (B == 1) {
pvalue <- pchisq(Ta, 2, lower.tail = FALSE)
res <- c(Ta, pvalue)
names(res) <- c(’test’, ’p-value’)
} else {
Tb <- numeric(B)
for (i in 1:B) {
nu <- sample(1:n, n, replace = TRUE)
z <- x[nu, ]
estim <- vmf(z)
313
k <- estim$k ## the estimated concentration parameter
## under the H0, that the Fisher distribution is true
mu <- estim$mu ## the estimated mean direction under H0
P <- i3 - tcrossprod(e1 - mu) / (1 - mu[1])
y <- tcrossprod(z, P)[, 2:3]
lam <- eigen( crossprod(y) / n, symmetric = TRUE )$values
rat <- besselI(k, 0.5, expon.scaled = TRUE)/besselI(k, 2.5, expon.scaled = TRUE)
Tb[i] <- n * (k/2)^2 * rat * (lam[1] - lam[2])^2
}
res
}
Wood (1994) provided a new algorithm for simulating from the von Mises-Fisher distribu-
tion. It is essentially a ejection sampling algorithm which we meet it again in Dhillon and
Sra (2003). We wrote the R code presented below based on the paper by Dhillon and Sra
(2003). The arguments of the algorithm are µ , k and n, the mean direction, the concentration
parameter and the sample size. The algorithm given below generates vectors from the mean
direction (0, . . . , 0, 1) and then using the rotation matrix (10.1) we transform the vectors so
that they have the desired mean direction. This algorithm works for arbitrary q in Sq .
1− b
4. x0 = 1+ b
p −1
5. m = 2
314
6. c = kx0 + (d − 1) log 1 − x02
8. for i in 1 : n
• t = −1000
• u=1
• while (t − c < log (u))
– Generate z from Beta (m, m) and u from U (0, 1)
1−(1+b)∗z
– w= 1−(1−b)∗z
– t = k ∗ w + ( p − 1) ∗ log (1 − x0 ∗ w)
9. Generate v1 from Np−1 0, I p−1
v1
10. v = kv1k
. This is a uniform p − 1 dimensional unit vector
√
11. S[i, ] = 1 − w2 ∗ v, w
12. Calculate the rotation matrix A using (10.1) in order to rotate the initial mean direction
from ini to µ .
13. X=AS. The X comes from a von Mises-Fisher distribution with concentration parame-
ter k and mean direction µ .
The R code given below is a bit slower than the the function found in Hornik and Grün
(2014) but it still sees the job through and you can see what the algorithm does.
315
w <- numeric(n)
m <- 0.5 * d1
ca <- k * x0 + (d - 1) * log(1 - x0^2)
for (i in 1:n) {
ta <- -1000
u <- 1
while ( ta - ca < log(u) ) {
z <- rbeta(1, m, m)
u <- runif(1)
w[i] <- ( 1 - (1 + b) * z ) / ( 1 - (1 - b) * z )
ta <- k * w[i] + d1 * log(1 - x0 * w[i])
}
}
S <- cbind(v, w)
A <- rotation(ini, mu) ## calculate the rotation matrix
## in order to rotate the initial mean direction from ini to mu
x <- tcrossprod(S, A) ## the x has direction mu
} else { ## uniform distribution
## requires MASS if k = 0
x1 <- matrix( RcppZiggurat::zrnorm(n * d), ncol = d )
x <- x1 / sqrt( Rfast::rowsums(x1^2) )
}
x
}
Kent et al. (2013) proposed the angular central Gaussian (ACG) distribution (Tyler, 1987) as
an envelope distribution in the rejection sampling algorithm for generating random values
from a Bingham distribution. The Bingham distribution on the (hyper)sphere Sq−1 is written
as
T Ax
f bing (x) = cbing e(−x )=c ∗
bing f bing ( x ) ,
where cbing is the normalizing constant and A is a q × q symmetric matrix. The density of
the central angular distribution is
∗
f ACG (x) = c ACG f ACG (x) ,
316
Γ(q/2) −q/2
where where c ACG = 2π q/2 |Ω |−1/2 is the normalizing constant and f ACG
∗ (x) = x T Ω x .
To simulate a random value from the ACG one has to generate a random value from a
multivariate normal and then normalize it such that its unit vector is 1. If y ∼ Nq (0, Σ ), then
y
x = kyk follows an ACG (Ω ) with Ω = Σ −1 .
Before we explain the algorithm of how simulate from the Bingham distribution we will
say a few tricks. First, we will obtain the eigenvalues λ1 ≥ λ2 ≥ . . . λq of the symmetric
0
matrix A. Then subtract the smallest eigenvalue from themn all and thus o we have λ1 ≥
λ20 ≥ . . . λ0q = 0. Then form the diagonal matrix Λ 0 = diag λ10 , . . . , λ0q . As Fallaize and
Kypraios (2014) mention, if x comes from a Bingham with matrix parameter A, then y = xV
comes from a Bingham with matrix parameter Λ , and this matrix comes from the spectral
decomposition of A = VΛ Λ VT .
The next code simulates observations from a Bingham distribution with a diagonal ma-
trix parameter say Λ 0 . The input eigenvalues are the q − 1 non zero eigenvalues λi0 for
i = 1 . . . , q − 1. So, if you right multiply the matrix containing the simulated values by
V T the transformed matrix contains the simulated values from a Bingham with a matrix
parameter A.
The constant changes only and in fact if we subtract or add the same scalar to all eigen-
values the constant is multiplied or divided respectively, by the exponential of that scalar.
One more key thing we have to highlight is that this distribution is used for modelling
axial data. This is because it has the so called antipodal symmetry. That is, the direction is
not important, the sign in other words is irrelevant in contrast to the von Mises or the von
Mises-Fisher distribution. Thus, f bing (x) = f bing (−x).
The steps to describe the rejection sampling in order to simulate from a Bingham distri-
bution are a combination of Kent et al. (2013) and of Fallaize and Kypraios (2014).
Christopher Fallaize and Theo Kypraios from the university of Nottingham have pro-
vided the following R code for simulating from a Bingham distribution. They have set b = 1,
even though it’s not the otpimal solution but as they say it works well in practice.
317
## lam are the q - 1 non zero eigenvalues
lam <- sort(lam, decreasing = TRUE) ## sort the eigenvalues in desceding order
nsamp <- 0
X <- NULL
lam.full <- c(lam, 0)
qa <- length(lam.full)
mu <- numeric(qa)
sigacginv <- 1 + 2 * lam.full
SigACG <- diag( 1 / ( 1 + 2 * lam.full ) )
Ntry <- 0
The next function is a more general than the previous one for a non diagonal symmetric
matrix parameter A and it calls the previous function.
318
rbingham <- function(n, A) {
Ntry <- 0
319
## the x contains the simulated values
tcrossprod(x, V) ## simulated data
Kent et al. (2013) mentions that the Fisher-Bingham distribution (10.6) can be bounded
by a Bingham density
T (1) T (1)
∗
f FB ( x ) ≤ e (κ − x A x ) = eκ e ( − x A x ) , (10.15)
where A(1) = A + (κ/2) Iq − µµ T . The story now is known more or less. Initially we use
the rejection sampling to generate from this Bingham distribution (see the functions f.rbing
and rbingham in the previous section). Then, we use again rejection sampling to see which
of them we will keep. We keep the simulated values for which the inequality (10.15) holds
true.
But, initially, we simulate from a Fisher-Bingham with mean direction equal to (0, 1, 0)
and then we rotate the data (rotation function) such that the mean is where we want it to be.
The next function does something not very clever but at least fast enough. It generates
5 times the requested sample (n) from a Bingham distribution and then sees how many of
them are accepted as coming from the Fisher-Bingham distribution. I assume the accepted
ones will be more than n and so then it randomly selects n of them. Two rejection samplings
take place and that is why I did this. Below is the old code.
320
eig <- eigen(A1)
lam <- eig$values
V <- eig$vectors
lam <- lam - lam[q]
lam <- lam[-q]
x1 <- matrix( 0, n, 3 )
i <- 1
while (i <= n) {
x <- f.rbing(1, lam)$X ## Chris and Theo’s code
x <- tcrossprod(x, V) ## simulated data
u <- log( runif(1) )
ffb <- k * x[, 2] - sum( x %*% A * x )
fb <- k - sum( x %*% A1 * x )
if ( u <= c(ffb - fb) ) {
x1[i, ] <- x
i <- i + 1
}
}
Can we make it faster? Yes we can. When calling f.rbing(1, lam) we can change it to
simulate more than one values. If we use a for loop to simulate values one by one, that is
slow. That is why we simulate many values and check how many of them are accepted. We
then simulate more values and check how many are accepted until we sample the desired
number of values.
321
A1 <- A + k/2 * ( diag(q) - m0 %*% t(m0) )
eig <- eigen(A1)
lam <- eig$values
V <- eig$vectors
lam <- lam - lam[q]
lam <- lam[-q]
x <- f.rbing(n, lam, fast = TRUE)$X ## Chris and Theo’s code
x <- tcrossprod(x, V) ## simulated data
u <- log( runif(n) )
ffb <- k * x[, 2] - Rfast::rowsums( x %*% A * x )
fb <- k - Rfast::rowsums( x %*% A1 * x )
x1 <- x[u <= c(ffb - fb), ]
n1 <- dim(x1)[1]
If we want to simulate from a Kent distribution then we have to use the rfb function we
saw in Section 10.8.3. The point is to suitably fix the parameter µ and A of (10.14). So for
a concentration parameter κ and an ovalness parameter β, we would have to specify the A
matrix, the ovalness parameter basically as
rfb(n, k, m, A)
Try this with some values of µ , κ and β and then use the kent.mle function above to see the
estimates of κ and β.
322
10.8.4 Simulation of random values from a von Mises-Fisher mixture model
In order to simulate values from mixture model, we need the sample size, the mixing prob-
abilities and the mean vector and concentration parameter of each population. The rvmf
function will prove useful.
We provide a simple function to produce contour plots of the von Mises-Fisher distribution.
Georgios Pappas from the University of Nottingham made this possible. He explained the
idea to me and all I had to do was write the code. The con Mises-Fisher direction needs two
arguments, a mean direction (µ µ ) and a concentration parameter (κ). Similar to other distri-
butions, the mean direction is not really important. The shape will not change if the mean
direction changes. So we only need the concentration parameter. Since this distribution is
rotationally symmetric about its mean the contours will be circles. Rotational symmetry is
the analogue of a multivariate normal with equal variance in all the variables and zero cor-
relations. In other words, the covariance matrix is a scalar multiple of the identity matrix.
We rewrite the density as we saw it in (10.2), excluding the constant terms, for conve-
323
nience purposes.
T
f p (x; µ, κ ) ∝ exp κµ x ,
µ
We need a plane tangent to the sphere exactly at the mean direction. The inner product of
the a unit vector with the mean direction which appears on the exponent term of the density
(10.2) is equal to an angle θ. So for points on the tangent plane we calculate this angle every
time and then calculate the density (which needs only κ now). If you did not understand
this ask a physicist, they do angles and know of manifolds in general.
Let us see this graphically now. See Figure 10.1 below. Suppose this is one slice of a
quarter of a sphere. We have a point on the sphere (A) and want to project it onto the
tangent plane. The plane is tangent to the mean direction which is the black vertical line, the
segment OB. What we want to do now, is flatten the sphere (or peel off if you prefer), so that
the point A touches the plane. The green line is the arc, OA, and the point A” on the plane
corresponds to A on the sphere. The important feature here is that the length of OA and the
length of OA” are the same. So we projected the point A on the plane in such a way that
it’s arc length from the mean direction remains the same on the plane. How much is this arc
length? The answer is equal to θ radians, where θ is the angle formed by the two radii, OB
and BA.
The other case is when we project the chord of the sphere (red line) onto the plane and
in this case the point A on the sphere corresponds to point A’ on the tangent plane. In this
case, the length of OA and OA’ are the same. I believe the colours will help you identify the
relation between the point on the circle and on the tangent plane.
The mean direction is not important, but the angle between a point on the sphere and
its mean direction is, and we only need the concentration parameter to define our contour
plots. Similarly to the univariate case, where the relevant distance between the points and
the mean is of importance only and not the mean itself and then the variance determines
the kurtosis of the distribution. So, here the angle between the observations and the mean
direction only is important. Thus, in the plane we take lots of points and we calculate the
angles from the mean direction every time. The concentration parameter is what affect what
we see.
In this case, the von Mises-Fisher distribution, the contour plots will always be circles,
because this distribution is the analogue of an isotropic multivariate normal (no correlation
and all variances equal). The higher the concentration parameter κ is, the more gathered the
circles are, and so on. Let us highlight that we peeled off the sphere here (i.e. used the green
line in Figure 10.1).
324
1.0
B
θ
0.8
0.6
0.4
A
0.2
O A’ A’’
0.0
Figure 10.1: A slice of a quarter of the sphere along with a chord and an arc. The red and
green lines indicate the projection of the point on the sphere onto the tangent plane.
The Kent distribution as we saw it in (10.11) has the following formula on the sphere
2 2
−1
f (x) = c (κ, β) exp α 1T x +
κα T T
β α2 x − α3 x ,
325
The parameters κ and β are the two arguments necessary for the construction of the
contour plots, since as we said in the case of the von Mises-Fisher distribution, the mean
direction is not important, but the angle between it and the points is. As for the two other
terms in the exponential, they are also expressed in terms of angles (see also Kent 1982). Let
us only say that in this case we used the projection described using the red line in Figure
10.1.
We will mention two more things, first, that this function requires (whenever the Kent
distribution is involved actually) the fb.saddle function and secondly, note that when κ >
β the distribution is unimodal as Kent (1982) mentions. If the opposite is true, then the
distribution is bimodal and has some connections with the Wood distribution Wood (1982).
This function is different from the previous one. Suppose you have data, a matrix with two
columns, the first is latitude and the second is the longitude (you can also have a matrix with
three columns and it is transformed into the spherical coordinates internally). We apply the
326
euclid function to transform the data into Euclidean coordinates, then fit a Kent distribution
(kent.mle function) and estimate the desired parameters and axes. Then for a grid of points,
latitude and longitude we calculate the density at these points and then plot the contour
and the points. Note, that this is not the plot of the Lambert projected data. We do the same
thing in the contour plot of a von Mises-Fisher kernel and in the contour plot of mixtures of
von Mises-Fisher distributions.
for (i in 1:n) {
for (j in 1:n) {
y <- euclid( c(x1[i], x2[j]) )
can <- -ckb + k * y %*% G[, 1] + b * (y %*% G[, 2])^2 -
b * (y %*% G[, 3])^2
327
if ( abs(exp( can) ) < Inf ) {
mat[i, j] <- exp(can)
} else mat[i, j] <- NA
}
}
10.9.4 Contour plots of a von Mises-Fisher kernel density estimate on the sphere
We have seen how to construct the kernel density estimate of spherical data using a von
Mises-Fisher kernel (Section 10.7.3). Given that we have a matrix with two columns, latitude
and longitude, the goal is to plot these two columns and also see some kernel contour plots.
The idea is simple, at first we transform the data into the euclidean coordinates (see euclid
function about this) and then we choose the bandwidth parameter h either using maximum
likelihood cross validation or by the rule of thumb suggested by Garcı́a-Portugués (2013).
Once we decide on the value of h, we need a grid of points at which we calculate the kernel
density estimate using the sample. Finally, the ready built-in function in R, contour shows
the contour plots. So, unlike the two previous functions, where the contour plots appear
with no data, the following R code plots the data and shows their kernel contour plots.
if (thumb == "none") {
h <- as.numeric( vmfkde.tune_2(x, low = 0.1, up = 1)[1] )
}
if (thumb == "rot") {
k <- vmf(x)$kappa
h <- ( (8 * sinh(k)^2) / (k * n * ( (1 + 4 * k^2) * sinh(2 * k) -
328
2 * k * cosh(2 * k)) ) )^(1/6)
}
n <- 100 ## n1 and n2 specify the number of points taken at each axis
x1 <- seq(min(u[, 1]) - 5, max(u[, 1]) + 5, length = n) ## latitude
x2 <- seq(min(u[, 2]) - 5, max(u[, 2]) + 5, length = n) ## longitude
cpk <- 1 / ( ( h^2)^0.5 *(2 * pi)^1.5 * besselI(1/h^2, 0.5) )
mat <- matrix(nrow = n, ncol = n)
for (i in 1:n) {
for (j in 1:n) {
y <- euclid( c(x1[i], x2[j]) )
a <- as.vector( tcrossprod(x, y / h^2) )
can <- mean( exp(a + log(cpk)) )
if (abs(can) < Inf) {
mat[i, j] <- can
} else mat[i, j] <- NA
}
}
contour(x1, x2, mat, nlevels = 10, col = 2, xlab = "Latitude",
ylab = "Longitude")
points(u[, 1], u[, 2])
}
The next function produces contour plots of the von mises-Fisher mixture model for spheri-
cal data. The data must be in two columns, latitude and longitude respectively and we also
need an object carrying the results of the mixture model.
329
tmu <- t(mu)
k <- mod$param[, 4] ## concentration parameters
p <- mod$param[, 5] ## mixing probabilities
g <- length(p) ## how many clusters
lika <- con <- numeric(g)
for (l in 1:g) {
con[l] <- 0.5 * log(k[l]) - 1.5 * log(2 * pi) - log(besselI(k[l], 0.5,
expon.scaled = TRUE)) - k[l]
}
for (i in 1:n1) {
for (j in 1:n2) {
#x <- c( cos(x1[i]) * cos(x2[j]), cos(x1[i]) * sin(x2[j]), sin(x2[j]) )
x <- euclid( c(x1[i], x2[j]) )
lika <- con + k * ( x %*% tmu )
can <- sum( p * exp(lika) )
if (abs(can) < Inf) {
mat[i, j] <- can
} else mat[i, j] <- NA
}
}
contour(x1, x2, mat, nlevels = 8, col = 4, xlab = "Latitude",
ylab = "Longitude")
points(u[, 1], u[, 2])
}
There are not many papers on discriminant analysis. We will use the von Mises-Fisher
distribution to perform this analysis (Morris and Laycock, 1974) similarly to the multivariate
(or univariate) normal in R p . The idea is simple. For each group we estimate the mean
vector and the concentration parameter and then the density of an observation is calculated
for each group. The group for which the density has the highest value is the group to which
the observation is allocated. We saw the form of the von Mises-Fisher density in (10.2).
To avoid any computational overflow stemming from the Bessel function we will use the
logarithm of the density and that will be the discriminant score
p 1
log κi + κi z T µ i − log (2π ) − log I p/2−1 (κi ) ,
δi =
2 2
330
for i = 1, . . . , g, where g is the number of groups, κi and µ i are the concentration parameter
and mean direction of the i-th group and z is an observation in S p−1 . At first we have to
see how well the method does. For this we have created the next R function to estimate the
error via cross validation.
for (i in 1:R) {
mat <- matrix(nrow = frac, ncol = g)
est <- numeric(frac)
nu <- sample(1:n, frac)
test <- x[nu, ]
id <- ina[-nu]
train <- x[-nu, ]
for (j in 1:g) {
da <- vmf(train[id == j, ]) ## estimates the parameters of the vMF
mesi[j, ] <- da$mu ## mean direction of each group
k[j] <- da$kappa ## concentration of each group
}
for (j in 1:g) {
mat[, j] <- (p/2 - 1) * log(k[j]) + k[j] * test %*% mesi[j, ] - 0.5 *
p * log(2 * pi) - log( besselI(k[j], p/2 - 1, expon.scaled = TRUE) ) - k[j]
}
est <- Rfast::rowMaxs(mat)
331
per[i] <- sum( est == ina[nu] ) / frac
}
332
nu <- nrow(xnew)
mat <- matrix(nrow = nu, ncol = g)
est <- numeric(nu)
for (j in 1:g) {
da <- vmf(x[ina == j, ]) ## estimates the parameters of the vMF
mesi[j, ] <- da$mu ## mean direction
k[j] <- da$k ## concentration
}
for (j in 1:g) {
mat[, j] <- (p/2 - 1) * log(k[j]) + k[j] * xnew %*% mesi[j, ] - 0.5 * p *
log(2 * pi) - log( besselI(k[j], p/2 - 1, expon.scaled = TRUE) ) - k[j]
}
Rfast::rowMaxs(mat)
}
We will use the angular distance we saw in compositional data (8.39) for the k-NN algorithm.
The angular distance between x and y ∈ Sd−1 is defined as
−1 T
D (x, y) = cos x y .
The function below is used to allocate new observations to some known groups for a
given number of nearest neighbours.
333
n <- dim(x)[1] ## sample size
ina <- as.numeric(ina) ## makes sure ina is numeric
nc <- max(ina) ## The number of groups
nu <- nrow(xnew)
apo <- tcrossprod(x, xnew)
apo <- acos(apo)
g <- numeric(nu)
ta <- matrix(nrow = nu, ncol = nc)
if (type == "NS") {
## Non Standard algorithm
for (m in 1:nc) {
dista <- apo[ina == m, ]
dista <- Rfast::sort_mat(dista)
if ( mesos ) {
ta[, m] <- Rfast::colmeans( dista[1:k, ] )
} else ta[, m] <- k / Rfast::colsums( 1 / dista[1:k, ] )
}
g <- Rfast::rowMins(ta)
} else {
## Standard algorithm
for (l in 1:nu) {
xa <- cbind(ina, apo[, l])
qan <- xa[order(xa[, 2]), ]
sa <- qan[1:k, 1]
tab <- table(sa)
g[l] <- as.integer( names(tab)[ which.max(tab) ] )
}
}
g
}
In order to select, or choose, the number of nearest neighbours, we apply an m-fold cross
validation and estimate the bias using the TT estimate of bias.
334
## M is the number of folds, set to 10 by default
## A is the maximum number of neighbours to use
## ina indicates the groups, numerical variable
## type is either ’S’ or ’NS’. Should the standard k-NN be use or not
## if mesos is TRUE, then the arithmetic mean distange of the k nearest
## points will be used.
## If not, then the harmonic mean will be used. Both of these apply for
## the non-standard algorithm, that is when type=’NS’
if ( is.null(mat) ) {
nu <- sample(1:n, min( n, round(n / M) * M ) )
## It may be the case this new nu is not exactly the same
## as the one specified by the user
## to a matrix a warning message should appear
suppressWarnings()
mat <- matrix( nu, ncol = M )
} else mat <- mat
M <- dim(mat)[2]
per <- matrix(nrow = M, ncol = A - 1)
rmat <- dim(mat)[1]
335
id <- as.vector( ina[ mat[, vim] ] ) ## groups of test sample
ina2 <- as.vector( ina[ -mat[, vim] ] ) ## groups of training sample
aba <- as.vector( mat[, vim] )
aba <- aba[aba > 0]
apo <- dis[-aba, aba]
ta <- matrix(nrow = rmat, ncol = ng)
if (type == "NS") {
## Non Standard algorithm
for ( j in 1:c(A - 1) ) {
knn <- j + 1
for (l in 1:ng) {
dista <- apo[ina2 == l, ]
dista <- Rfast::sort_mat(dista)
if ( mesos ) {
ta[, l] <- Rfast::colmeans( dista[1:knn, ] )
} else {
ta[, l] <- knn / Rfast::colsums( 1 / dista[1:knn, ] )
}
}
g <- Rfast::rowMins(ta)
per[vim, j] <- sum( g == id ) / rmat
}
} else {
## Standard algorithm
g <- numeric( rmat )
for ( j in 1:c(A - 1) ) {
knn <- j + 1
for (k in 1:rmat) {
xa <- cbind(ina2, apo[, k])
qan <- xa[order(xa[, 2]), ]
sa <- qan[1:knn, 1]
tab <- table(sa)
g[k] <- as.integer( names(tab)[ which.max(tab) ] )
}
per[vim, j] <- sum( g == id ) / rmat
}
}
336
}
D
h (x|Θ ) = ∑ πi fi (x|θ i ) ,
i =1
337
function and estimate its parameters.
The t-th step of the algorithm is briefly described below
πit−1 f i (x|θ i )
pijt = D t −1
∑m =1 πm f m ( x |θ m )
In order to solve the equation and obtain κ, the reader is referred back to (10.3), the
two-step truncated Newton-Raphson solution given by Sra (2012).
Step 3. Repeat the E and M steps until the log-likelihood does not increase any more.
We need some initial values to start with. For this reason, similarly to Hornik and Grün
(2014) we will start with a K-mean clustering. Hornik and Grün (2014) suggests a spherical
K-means algorithm but we did the classical K-means algorithm for Euclidean data. So, take
the predicted memberships from the output of the algorithm and calculate the π̂i0 s. Then
proceed to the M step and calculate µ 0i and κ̂i0 .
338
l <- 1
mesa <- array(dim = c(g, p, 50))
crit <- numeric(50)
cl <- matrix(nrow = n, ncol = 50)
if ( min(w) <= 3 ) {
mess <- paste( "Too many clusters to fit for this data. Try one less" )
res <- list(mess = mess, loglik = NA)
} else {
for (j in 1:g) {
R <- Rk[j]
k <- numeric(4)
i <- 1
k[i] <- R * (p - R^2)/(1 - R^2)
i <- 2
apk <- Apk(p, k[i - 1])
k[i] <- k[i - 1] - ( apk - R)/( 1 - apk^2 - (p - 1)/k[i - 1] * apk )
while (abs(k[i] - k[i - 1]) > 1e-07) {
i <- i + 1
apk <- Apk(p, k[i - 1])
k[i] <- k[i - 1] - (apk - R)/( 1 - apk^2 - (p - 1)/k[i - 1] * apk )
}
339
ka[j] <- k[i] ## initial concentration parameters
lika[, j] <- (p/2 - 1) * log(ka[j]) - 0.5 * p * log(2 * pi) -
log(besselI(ka[j], p/2 - 1, expon.scaled = TRUE)) - ka[j] +
ka[j] * (x %*% mat[j, ])
}
l <- 2
## Step 2
pij <- wlika / rswlika ## weights at step 2
w <- Rfast::colmeans(pij) ## weights for step 2
for (j in 1:g) {
m1 <- Rfast::colsums(pij[, j] * x)
mat[j, ] <- m1 / sqrt( sum(m1^2) ) ## mean directions at step 2
R <- sqrt( sum(m1^2) ) / sum( pij[, j] ) ## mean resultant lengths at step 2
k <- numeric(4)
i <- 1
k[i] <- R * (p - R^2)/(1 - R^2)
i <- 2
apk <- Apk(p, k[i - 1])
k[i] <- k[i - 1] - ( apk - R)/( 1 - apk^2 - (p - 1)/k[i - 1] * apk )
340
## Step 3 and beyond
while ( lik[l] - lik[l - 1] > 1e-05 ) {
l <- l + 1
for (j in 1:g) {
m1 <- Rfast::colsums(pij[, j] * x)
mat[j, ] <- m1 / sqrt( sum(m1^2) ) ## mean directions at step l
R <- sqrt( sum(m1^2) ) / sum(pij[, j]) ## mean resultant lengths at step l
k <- numeric(4)
i <- 1
k[i] <- R * (p - R^2)/(1 - R^2)
i <- 2
apk <- Apk(p, k[i - 1])
k[i] <- k[i - 1] - (apk - R)/( 1 - apk^2 - (p - 1)/k[i - 1] * apk )
341
rownames(param) <- paste("Cluster", 1:g, sep = " ")
res <- list(param = param, loglik = lik[l], pred = ta, runtime = runtime)
res
A good method to choose how many components one wants is via the BIC.
342
}
where ρ = 2 sin θ/2. In our case, the radius is one, but if you multiply by 2 then the radius
becomes 2. So this projection corresponds to the red line in Figure 10.1.
At first, let us say something. We must rotate the data so that their mean direction is the
north pole (for convenience reasons) and then spread, open, expand the north hemisphere
so that it becomes flat (or project the points on the tangent to the north pole plane). So start-
ing from two sets of points (latitude and longitude) we move on to the sphere (Euclidean
coordinates), then find their mean direction, rotate the data such that their mean direction
is the north pole, go back to the latitude and longitude and then apply (10.16). For the next
two functions we need the functions euclid, rotation, vmf and euclid.inv.
343
phi <- u2[, 2]
rho <- 2 * sin(theta / 2) ## radius of the disk is sqrt(2)
z1 <- rho * cos(phi) ## x coordinate
z2 <- rho * sin(phi) ## y coordinate
cbind(z1, z2) ## the Lambert equal area projected data on the disk
}
The inverse of the Lambert projection is given by the next R function. For this one we
need to have the original mean direction towards which we will bring the back onto the
sphere. We reverse the projection of the data onto the sphere and then rotate them from the
north pole to the given mean direction. Then we transform them into latitude and longitude.
z <- as.matrix(z)
long <- ( atan(z[, 2]/z[, 1]) + pi * I(z[, 1] < 0) ) %% (2 * pi)
lat <- 2 * asin( 0.5 * sqrt( Rfast::rowsums(z^2) ) )
u <- cbind(lat, long) ## the data on the sphere in radians
u <- u * 180/pi ## from radians to degrees
y <- euclid(u) ## the data in euclidean coordinates
344
Log of changes from version to version
After a massive public demand (one e-mail basically) I was suggested to add a log a log of
the changes in different versions or any changes I make. I started from version 3.9 as I do
not remember the previous changes.
98. 1/6/2021. Version 9.7. I corrected a bug in the choose.pc and made it significantly faster.
I corrected a mistake in the functions kern.reg and kernreg.tune.
97. 29/2/2020. Version 9.6. I made the functions pcr and pcr.tune faster.
96. 2/8/2019. Version 9.5. Minor changes in the code of the functions. Addition of the fast
permutation p-value for the correlation. I also updated the code in some functions,
taken from my R packages.
95. 26/3/2019. Version 9.4. Addition of a code to faster simulate from the Bingham and
related distributions.
94. 25/2/2019. Version 9.3. Small and minor corrections. Change of my e-mail address.
93. 9/4/2018. Version 9.2. Time optimisation of a few functions. Added some more func-
tions. Correction of a bug in fisher.da. The function mle.lnorm is now called mvlnorm.mle.
92. 20/12/2016. Version 9.1. I made some functions faster, namely the rmvnorm, rmvt,
rmvlaplace and diri.nr(x, type = 2). A typo in the function ppca is now corrected and this
function has been renamed into ppca.choose because the function ppca is now added. In
general, many improvements in terms of time have been carried out for many func-
tions.
91. 14/10/2016. Version 9. I made many functions faster, even up to 10 time faster, and
added an error stopping case in the function mix.vmf. If in the first step there are 3 or
less observations in a cluster, the process will stop. Addition of a function which pro-
duces a permutation based p-value for the hypothesis testing of correlations between
a vector and many variables. Correction of a mistake in the function rot.matrix.
90. 28/5/2016. Version 8.9. Addition of the median direction for spherical and hyper-
spherical data. Correction of a mistake in the function circ.summary. I think Mardia
and Jupp (2000) had a mistake or a misunderstanding of the lower and upper quantiles
of a distribution. I made the function rda.tune. Of course, it can still go faster, but this
will happen later. I have to change some other cross-validation functions as well. The
point is to avoid unnecessary calculations and the big thing is to avoid doing the same
calculations multiple times. During my PhD this function was so badly written that
instead of minutes (now) it was taking days. I vectorised the function mle.lnorm (now
345
is faster) and made many functions a bit faster. Time is measured in many functions
now. Correction of a mistake in the p-value calculation in the function sarabai. I fixed
a minor mistake in the MNV variation of the James test (function james). I added two
functions for faster calculation of the covariance and the correlation matrices. I made
the function sq.correl faster, doing less calculations. The matrix exponential is now a
general function for any full rank square matrix. The function sarabai is now about
6 times faster due to a more efficient use of linear algebra. I also made some more
functions a bit faster. Two more, faster functions for the spatial median regression
have been added.
89. 15/2/2016. Version 8.8. Addition of a newly suggested method for outlier detection
for high-dimensional data. If you use the predict.compknn or the compknn.tune functions
with the angular, the Hellinger or the taxicab distance it is now much faster. The two for
loops used in each distance are gone. I fixed some problems with the functions alfa.reg,
alfareg.tune and their dependant functions as well. I added a few stuff in the Wishart
distribution and the MLE for the Dirichlet-multinomial. The discriminant analysis
for multinomial data now offers now the product of Poisson distributions and the
Dirichlet-multinomial distribution. Addition of the k-NN algorithm for circular and
(hyper-)spherical data. I fixed a bug in the functions knn.reg, kern.reg and ridge.reg. I
also changed the functions knnreg.tune, kernreg.tune, pcr.tune, glmpcr.tune and ridge.tune.
An m-fold cross validation is now implemented for all of them. I corrected a typo in
the function alfaknn.tune. Addition of the function comp.den which is a rather generic
function for estimating distribution parameters in compositional data. Addition of the
function fast.alfa as a very fast alternative to the function profile.
88. 11/12/2015. Version 8.7. Addition of the α-regression and the ESOV regression for
compositional data. The plot in the function multivreg is now optional. An argument
is added (TRUE or FALSE). This affects the function comp.reg, where there no plot
will appear by default. A slight change in the functions mrda, pcr, pcr.tune, spml.reg
and spher.reg. mvreg.cat and some functions for regression with compositional data as
responses offer bagging (bootstrap aggregation). I added the α-regression, when the
response variable is compositional data. I changed the function for the calculation of
the spatial median using an iterative algorithm which is much much faster than when
using an optimiser such as nlm.
87. 18/10/2015. Version 8.6. Many changes have taken place and in particular debugging
of most of the functions. Addition of the parallel computing option in the knn.tune
function. The functions knn.tune, ridge.tune and mkde.tune allow now parallel compu-
tation. The compknn.tune is being used by compknntune.par which allows for parallel
computation. A slight presentation change in the compknn.tune. A function to cal-
culate the matrix of partial correlations has been added. Correction of a typo in the
346
ridge.multivreg function. Addition of an argument in the function cov.gen. Change of
the functions comp.compreg, textitcompcompreg.tune and comp.reg. Robust regression
is not supported now. I changed the output of the function multivreg and I think it is
better now. This affected the function comp.reg which was also changed. Correction
of a typo in the functions kl.compreg and ols.compreg. Going from optim to nlm I for-
got to change everything. Change of some names, from knn.tune to knnreg.tune, from
pred.knn to knn.reg, from kern.tune to kernreg.tune and from ridge.multivreg to ridge.reg.
A change in the function profile, I removed some unnecessary parentheses. The func-
tion vmkde.tune includes now an option for parallel computation. I made the function
rda.pred shorter by a couple of line and a lot faster. When you have many observations
and many variables the time differences are clear. I removed a coupe of lines from
the function pred.multinomda. The function expm had a silly mistake in the final line,
instead of expA there was expaA. Change of the title from Multivariate statistical func-
tions in R to Multivariate data analysis in R, a collection of functions. Some small changes
to the vmfda.pred and rda.pred functions. The function rda.tune offers parallel compu-
tation now. I added confidence intervals of the coefficients in the function multivreg.
A small change in the function ridge.tune. Some presentation and style changes in the
function ridge.reg. A small change of the output style in the function profile. Minor
changes in the functions mkde.tune, mkde and multivt. The function kern.reg had a bug
which is now fixed and I changed its format. I believe I made it more flexible now. The
function kernreg.tune has also changed a bit, one line was removed and another one
was slightly changed. Minor changes in the functions knnreg.tune and knn.reg. Reg-
ularised discriminant analysis and the k-NN algorithm for compositional data using
the α-transformation have been added. The α in the denominator of the α-distance
(8.20) is now with an absolute value. I never wrote it, even in my PhD thesis, be-
cause I knew it had to positive. Correction of a typo in the function pcr.tune. Some
presentation changes in the functions compknn.tune and compknntune.par. Addition
of the principal components regression when the independent variables are composi-
tional data by employing the α-transformation. In the ridge regression the predictor
variables are now being standardised first and in the tuning of λ, the default inter-
vals are smaller now but with a smaller step as well. A small change in the output
of the functions ridge.reg, hotel2T2 and james has now taken place. Addition of the
function ridge.plot. Many functions have been made either a bit faster or super faster
than before. I removed some redundant lines, vectorised them, or used faster func-
tions. A small change in the output of the functions boot.correl, lamconf, profile and
rvonmises. I made the function mix.vmf a bit shorter. In addition, the output style is
changed. The function bic.mixvmf is now shorter. The boot.correl had a stupid mistake
with the p-value, but now it is corrected. Addition of the function glm.pcr which does
principal components regression for binary and count data. Addition of the functions
347
pcr.plot and glmpcr.plot for visualizing the principal components regression. Change of
the function ridge.reg. Bootstrap standard errors are now being provided only. Some
subsubsections are now included in the subsections. Change of the output style in
the functions sym.test, fishkent, fb.saddle (and of the functions that call this function),
vmf.da, vmfda.pred rayleigh, vmf.kde, spher.cor, circlin.cor, hotel2T2. Addition of the func-
tion corr.mat. Panagiotis Tzirakis helped me vectorize some of the contour plots func-
tions. A change in the function james. The estimate of the common mean in the boot-
strap case is the most appropriate one now. I changed the function alfa. The output is
now a list, so all the functions involving this had to change slightly. The benefit of this
is that the function profile is much faster now, such as twice as fast. In addition, the
function adist changed its name into alfadist since there is an already built-in function
in R called adist. I added a vectorised version of the function boot.correland a vec-
torised version of bp.gof. I vectorised the kern.reg for only when the polynomial degree
equals zero. Nevertheless it is faster now. I vectorised the knn.reg as well. Addition
of the permutation based hypothesis testing for zero correlation (permcor). Correction
of the function cov.gen (very bad mistake) and change its name into rwishart. Addition
of the functions james.hotel and maovjames.hotel showing numerically the relationship
between the James tests and Hotelling’s one sample T2 test. I robustified the initial
values of the spatmed.reg function (spatial median regression). I added a second and
faster function (without parallel) to tune the bandwidth parameter in the multivariate
kernel density estimation. I will keep both of them, but change them at some point to
make them even faster. Both use 2 for loops which slows down things. A slight change
in the correl function. I changed the function rinvdir to rdir. There is an option to
specify whether Dirichlet or inverted Dirichlet values will be generated. The function
bckfold.da is now shorter. Finally I changed Fisher’s linear discriminant analysis. In ad-
dition, a function to estimate is performance is now added. Addition of the function
for multivariate linear regression allowing for mixed predictors. Correction of some
typos in the function spml.reg. Addition of the function vmfkde.tune2 as a faster alterna-
tive to vmfkde.tune. A change in the vmkde.tune took place. A small improvement of the
functions rvmf, Arotation, vmf.kerncontour and mixvmf.contour. A re-ordering of some
subsections took place in Section 10. Addition of the Newton-Raphson algorithm for
estimating the parameters of a Dirichlet distribution. A small change in the function
corr.mat. A small change in the circlin.cor, many variables can be considered now.
86. 14/7/2015. Version 8.5. Addition of the classification for compositional data using
the α-transformation Section. At the moment, only the power transformation for dis-
criminant analysis using the the k-NN algorithm has been added. Removal of the ro-
bust multivariate regression section (rob.multivreg function). I will do this again later.
Change of the order of some subsections in the compositional data Section. Correction
348
of a silly mistake in the diri.reg function. I was returning the logarithm of the φ param-
eter. I fixed it now. The output of the diri.reg and diri.reg2 functions is now corrected.
Addition of some tips for faster computations in the beginning. Addition of the E-M
algorithm explanation in the circular regression Section.
85. 25/6/2015. Version 8.4. Addition of the α-transformation, and its associated distance
and mean vector, for compositional data. The profile log-likelihood of α as a way
of choosing its values was added as well. Some structural and presentation related
changes in the Section of compositional data have taken place. I changed a bit the
dirireg function. In order for the φ parameter to be strictly positive I added an exponen-
tial term inside instead of the logical function. I took the sq.correl function regression
Section and put it in the correlation Section. Addition of the partial correlation coeffi-
cient. Addition of two brackets in a line of multinom.da function. In the description of
the algorithm for simulating data from a von Mises-Fisher distribution I had a not so
clear description of a while and for functions. Giorgos Borboudakis from the Founda-
tion of Research and Technology (FORTH) and member of the MXM group pointed it
out to me and I think now I made it a bit more clear now.
84. 27/5/2015. Version 8.3. Model based clustering for compositional data is now added,
fitting the model, contour plots and random values generation. Correction of a silly
mistake in the helm function. I have a different version and I did not spot that there was
a mistake here. If you find any more mistakes, please e-mail them to me. Correction
of a silly mistake in the rmixvmf function.
83. 18/5/2015. Version 8.2. Addition of the discrimination for multinomial data. Keep
formatting of more functions. Correction of some functions, some lines were outside
the paper size. Update of the rfb function and correction of a typo in the text following.
It was fb.sim, now it is rfb. I changed the diri.reg, diri.reg2, ols.compreg, kl.compreg func-
tions. I changed the optim function to nlm. This optimiser is faster. I changed the name
of the variables in the spher.reg function. From u and v we have now x and y respec-
tively. Change of the vmf function. At first, the tol is a bit smaller, from 10−8 it is now
set to 10−7 . In addition, I have just became aware of the overflow of the besselI function
when the concentration parameter (κ) is more than 100, 000. So, if the approximation
is more than 100, 000, the function will return that approximation.
82. 3/5/2015. Version 8.1. Addition of the Moore-Penrose pseudo-inverse matrix. I up-
dated the two Sections where I mention the contour plots of the skew normal distribu-
tion, for Euclidean and compositional data. Update of the diri.est function. I changed
a bit the type ”prec”. I corrected a silly mistake in the mkde function.
81. 24/4/2015. Version 8.0. Addition of the contour plots for the Kent distribution fitted to
some data. Correction of a typo in the mixvmf.contour function. This ”, was forgotten in
349
the penultimate line. I changed the fb.sim function. Now it is a bit shorter and correct.
The rotation matrix to get the preferred mean direction was missing and its new name
is rfb. Update of the kent.mle function. The estimation is more accurate now. Change
of the order in the (hyper)-spherical data Section. I made the Arotation shorter by one
line.
80. 23/4/2015. Version 7.9. Correction of a typo in Equation (10.4). The sum goes up
to n, not p. Addition of the model based clustering for mixtures of von Mises-Fisher
distributions. Functions for choosing the number of components (bic.mixvmf ), random
values simulation (rmixvmf ) and contour plots (mixvmf.contour) are added. Correction
of the previous correction, the square root in the metric for covariance matrices was not
necessary, since the square distance was mentioned in the formula of the metric. Fol-
lowing Marco Maier’s advice I changed the rep(0,R) to numeric(R), the apply(x,2,mean)
to colMeans(x) and the t(x)%*%x to crossprod(x). In addition, the quadratic form of the
normal and other expressions elsewhere have been now replaced with the squared
Mahalanobis distance. I formatted some more functions and updated some more as
well. Update of the comp.compreg and compcompreg.tune functions.
79. 21/4/2015. Version 7.8. Correction of a typo in the metric for covariance matrices
Section. The square root was missing from the formula. The function had it.
78. 16/4/2015. Version 7.8. Formatting of some more functions. Addition of an influential
diagnostic in the simple Dirichlet regression. A small change in the kl.compreg function.
I increased the tolerance value from 1−5 to 1−4 . Correction of a typo in the comp.reg
and spatmed.reg functions. A small change in the output of the diri.reg and diri.reg2
functions. Correction of a silly mistake in the plot of the pcr.tune function. Correction
of a mistake in the lnorm.contours, invdir.contours and kern.contours functions. From
the transition to the new version of the functions, these mistakes occurred. Correction
of a sily mistake in the rob.multivreg function. Addition of two functions to perform
regression analysis for compositional data when the independent variables are also
compositional data. The idea is to use principal component regression.
77. 9/4/2015. Version 7.7. Formatting of some more functions. Update of the vmf.kerncontour
function, an unnecessary line was removed.
76. 4/4/2015. Version 7.7. Format of some functions. The target is that all will change
in time. This is due to Marco Maier who suggested me to do this. He introduced
the R package formatR to me, which formats my functions one by one, so the whole
process will take some time. Addition of a metric for covariance matrices. Addition
of the Hotelling’s T 2 test for repeated measures ANOVA. Change of the diri.contour,
norm.contour, t.contour and skewnorm.contour functions. Now, the option to make the
350
data appear exists. Change of the output style in the f, hotel2T2 and james functions.
Correction of a typo in the het.circaov and circ.cor2 functions. Correction of a typo in
the comp.spatmed function. It worked for me before, that is why I had not spotted it
before.
75. 25/3/2015. Version 7.6. Addition of the inverted Dirichlet distribution, MLE of its
parameters (invdir.est), random values generation (rinvdir) and contours plots in the
case of two dimensional data (invdir.contours). Change of the presentation style of the
Section about Distributions.
74. 19/3/2015. Version 7.5. Addition of the bivariate Poisson distribution Section. MLE of
the parameters and a goodness of fit test are included. Contour plots of this distribu-
tion are also provided. MLE of the parameters of a multivariate log-normal and con-
tour plots of the bivariate distribution are also provided. Correction of a minor typo in
the multivariate Laplace distribution. The R function is not changed. A correction of a
typo in the fb.saddle function. The smallest eigenvalue in Fallaize and Kypraios (2014)
can be equal to zero, but in Kume and Wood (2005) it has to be greater than zero. Since
this function is based upon the second paper I had to correct it. Some rewording took
place there also.
73. 12/3/2015. Version 7.4. Addition of the von Mises-Fisher kernel contour plots for
spherical data. Change of the output style in the kfold.da, bckfold.da, lamconf, ternary
and vmf.da functions.
72. 1/3/2015. Version 7.3. Addition of the kernel density estimation for circular, spher-
ical and hyper-spherical data. Some word changes in the projected bivariate normal
regression of angular data. Change of the spher.reg function. If the 3 × 3 matrix is not
a rotation matrix, i.e. its determinant is not +1, a modification is made to make it 1.
Correction of a typo in the knn.tune, kern.tune and ridge.tune functions. Addition of two
lines in the APL-trimmed mean Section.
71. 23/2/2015. Version 7.2. Addition of the spatial sign covariance matrix and some re-
wording in the spatial median Section. Change of the names rand.mvnorm and rand.mvt
to rmvnorm and rmvt respectively. The cov.gen was also slightly changed, because the
rand.mvnorm was used. Correction of the kern.reg function in the Laplacian kernel. Ad-
dition of a multivariate Laplace distribution. Random values generation and moments
estimation of its parameters. A new Section, called Distributions, was created.
70. 20/2/2015. Version 7.1. Correction of a typo in the functions norm.contour, t.contour
and skewnorm.contour. Instead of type the argument in the function was iso. Update
of the mkde function. Two rules of thumb are now added. Update of the kern.contour
and apl.mean functions as well. Correction of a typo in the best.aplmean function and
351
some small changes in the wording of the Section about kernel regression. Addition
of the contour plots of a kernel density estimate for compositional data when there
are three components (function comp.kerncontour). Update of the spatmed.reg function
to include standard errors of the beta parameters. Correction of the kern.reg function.
The Laplacian kernel is now correct. Some rewording in that Section took place as
well.
69. 17/2/2015. Version 7.0. Addition of the multivariate kernel density estimation and
its contour plot for a 2-dimensional dataset. The robust statistical analyses subsection
became a Section termed Robust statistics. The spatial and spatial median regression
are now included in that Section. The APL-trimmed mean is added to that Section as
well.
68. 16/2/2015. Version 6.9. Change of the names dirireg2 and diri.reg2 into dirireg and
diri.reg respectively. Increase of the maximum iterations in the functions where opti-
mization is needed. Addition of the Dirichlet regression where the precision parameter
φ is linked with the same covariates as the compositional data. Correction of a typo
inside the spat.med function.
67. 7/2/2015. Version 6.8. Addition of the multinomial logit regression for compositional
data.
66. 5/2/2015. Version 6.7. Change of a typo in the cov.gen function and in general update
of the function and of the wording in that Section.
65. 2/2/2015. Version 6.7. Addition of another test for testing the equality of concen-
tration parameters of 2 or more samples in the circular and spherical data cases only.
Correction of a typographical mistake in het.circaov function. Expansion of the comp.reg
function to allow for more types of regression. Addition of the rvonmises function for
simulating random values from the the von Mises distribution using the rvmf function.
64. 29/1/2015. Version 6.6. The log-likelihood ratio test for the hypothesis of the concen-
tration parameters is deleted. The appropriate ones as described in Mardia and Jupp
(2000) will be added in the next versions.
63. 26/1/2015. Version 6.5. Addition of the meandir.test function for testing hypothesis
about the mean direction of a single sample.
62. 25/1/205. Version 6.5. Addition of the analysis of variance Section for two or more
circular samples. Hypothesis testing for the equality of the concentration parameters
are included as well. ANOVA for hyper-spherical data is also added but no hypothesis
testing for the equality of concentration parameters. The mean resultant length is given
as an output in the vmf function. The option to plot circular data is now added in the
352
circ.summary function. Some bits about the von Mises distributions are also added.
The density of the von Mises is removed from the circular regression. A modified test
statistic of the Rayleigh test of uniformity is also added. A presentational change has
taken place.
61. 20/1/2014. Version 6.4. Inclusion of the harmonic mean in the k-NN regression as an
option and correction of some typographical errors.
60. 25/12/2014. Version 6.4. The sq.correl function is now added. This gives a multi-
variate analogue of the coefficient of determination in the univariate regression. A
typographical mistake in the multivariate regression is now corrected, p is the number
of independent variables.
59. 5/12/2014. Version 6.3. The multivt2 function for the estimation of the parameters of
the multivariate t distribution is now added.
58. 3/12/2014. Version 6.2. The multivt function for the estimation of the parameters of
the multivariate t distribution is now updated.
57. 26/11/2014. Version 6.2. A high dimensional two sample mean vector hypothesis test-
ing procedure is now added, function sarabai. In addition, the cov.gen is a bit changed,
corrected I would say.
56. 7/11/2014. Version 6.1. Estimation of the Dirichlet parameters takes place in one
function now, called diri.est. I combined the functions diri.mle, diri.phi and diri.ent into
one function. The uses has to specify which estimating procedure he wants.
55. 1/11/2014. Version 6.1 The multivariate standardization functions became one func-
tion, now called multiv.stand. The first principal component can now be added in the
ternary diagram (function ternary) should the user wishes to see it. The Kuulback-
Leibler divergence and the Bhattacharyya distance, between two Dirichlet distribu-
tions became one function now.
54. 31/10/2014. Version 6.1. Addition of the James test for testing the equality of more
than 2 mean vectors without assuming equality of the covariance matrices (MANOVA
without homoscedasticity). Minor changes in the functions multivreg, rob.multivreg and
comp.reg. The argument for the betas in the list became beta instead of Beta.
53. 24/10/2014. Version 6.0. Multivariate ridge regression has been added. A way for
generating covariance matrices was also added and the two functions in the Dirichlet
regression were updated. Some minor typos were corrected.
52. 13/10/2014. Version 5.9. Addition of the spatial median and of the spatial median
regression. Addition of the spatial median for compositional data as well.
353
51. 8/9/2014. Version 5.8. After a break we return with corrections in the functions lambert
and lambert.inv. The mistake was not very serious, in the sense that the plot will not
change much, the relevant distances will change only. But even so, it was not the
correct transformation.
50. 28/7/2014. Version 5.8. Changes in the euclid and euclid.inv functions. The transfor-
mations inside the functions was not in accordance with what is described on the text.
Some typos in the spherical-spherical regression description are now corrected.
49. 25/7/2014. Version 5.8. Typographical changes in the circular summary and in the
projected bivariate normal sections. The codes are OK, but the descriptions had typos.
48. 2/7/2014. Version 5.8. A structural change and a correction in the diri.reg function and
name change only of multivnorm to rand.mvnorm. Increase of the the highest number
of degrees of freedom parameter in the multivt function and correction of a silly typo-
graphical mistake in the rand.mvnorm function. Addition of the rand.mvt for simulating
random values from a multivariate t distribution. Also a small change in the order of
some Sections. For some reason the rand.mvnorm would put the data in a matrix with
4 columns. So the result would always be a 4 dimensional normal. I corrected it now.
47. 29/6/2014. Version 5.7. A change in the rda.pred function. I made it faster by rearrang-
ing some lines internally. The function is the same. I also added the scores to appear
as outputs.
46. 26/6/2014. Version 5.7. Some morphological changes and addition of the Dirichlet
regression for compositional data. Addition of the forward search algorithm and the
contour plots of the von Mises-Fisher and Kent distributions. Georgios Pappas’ help
with the contours made them possible to appear in this document.
45. 25/6/2014. Version 5.6. Addition of the model selection process in discriminant anal-
ysis.
44. 23/6/2014. Version 5.5. A slight change in the ternary function, addition of a graph-
ical option. Changes in the Dirichlet estimation, I made them proper functions now.
Change in the multivreg function. There was a problem if there was one independent
variable with no name. I fixed a problem with the rob.multivreg function also. A minor
mistake fixed in the functions vmf.da and vmfda.pred which did not affect the outcome.
A constant term was wrong. The spher.reg function has become a bit broader now.
Compositional regression is now added.
43. 16/6/2014. version 5.4. Fixation of a silly mistake in the rbingham function. The mis-
take was in the second line of the code.
354
42. 13/6/2014. Version 5.4. Addition of the variance of the concentration parameter κ in
the vmf function.
40. 13/6/2014. Version 5.4. I fixed some mistakes in the functions circ.cor1, circ.cor2, cir-
clin.cor, spher.cor. The problem was that I was not drawing bootstrap re-samples under
the null hypothesis. So I removed the bootstrap. the same was true for the rayleigh
function. But in this function, I can generate samples under the null hypothesis. For
this purpose, parametric bootstrap is now implemented. In addition, the function
circ.summary changed and follows the directions of Mardia and Jupp (2000). A confi-
dence interval for the mean angle is also included now.
39. 11/6/2014. Version 5.4. Theo Kypraios spotted a mistake in the rbingham function
which has now been corrected.
38. 5/6/2014. Version 5.4. Addition of the test of Fisher versus Kent distribution on the
sphere. Some presentation changes occurred in the MLE of the von Mises-Fisher dis-
tribution section.
37. 4/6/2014: Version 5.3. Addition of the Rayleigh test of uniformity. Slight changes in
the kent.mle function regarding the presentation of the results.
36. 12/5/2014: Version 5.2. Some words added about estimating the concentration pa-
rameter in the von Mises-Fisher distribution.
35. 9/5/2014: Version 5.2. Editing of the Section about the simulation from a Bingham
distribution. More information is added to make it clearer and a new function is used
to simulate from a Bingham with any symmetric matrix parameter. A reordering of
some sections took place and also the addition of a function to simulate from a Fisher-
Bingham distribution and the Kent distribution on the sphere.
34. 8/5/2014: Version 5.1. Editing of the explanation of the function FB.saddle. I believe I
made some points more clear.
33. 7/5/2014: Version 5.1. Correction of a space mistake in the vmfda.pred function. A
line was not visible in the .pdf file. Correction of am mistake in the vmf function. The
log-likelihood was wrong.
32. 3/5/2014: Version 5.1 Addition of the parameter estimation in the Kent distribution
plus corrections of some typographical mistakes.
31. 10/4/2014: Version 5.0. Addition of the calculation of the log-likelihood value in the
von Mises-Fisher distribution and correction of typographical errors.
355
30. 2/4/2014: Version 5.0. Addition of the (hyper)spherical-(hyper)spherical correlation
and of the discriminant analysis for directional data using the von Mises-Fisher distri-
bution. Whenever the set.seed option appeared we made some modifications also. That
is, in the functions knn.tune, kern.tune, pcr.tune and rda.tune. addition of the seed option
in the functions kfold.da and bckfold.da. The function fb.saddle is slightly changed. Now
the logarithm of the Fisher-Bingham normalizing constant is calculated. This change
happened to avoid computational overflow when the constant takes high values.
29. 31/3/2014: Version 4.9 Some minor changes in the functions knn.tune and kern.tune.
28. 29/3/2014: Version 4.9. Addition of the Lambert’s equal are projection of the sphere
onto a tangent plane. Change in the regularised discriminant analysis function. Cross
validation for tuning of its parameters is now available.
27. 26/3/2014: Version 4.8. Fix of a silly mistake in the functions knn.tune and pred.knn.
26. 24/3/2014: Version 4.8. A minor correction in the function multivreg. A minor also
change related to its presentation words. Addition of the function rob.multivreg which
performs robust multivariate regression. Some presentation changes throughout the
document also.
25. 23/3/2014: Version 4.7. Minor change in the k-NN regression. Now it accepts either
Euclidean or Manhattan distance. Morphological change in the function correl and
change of some words in the relevant section.
24. 21/3/2014: Version 4.7. Fix of a stupid mistake in the function vmf. The mean direction
was wrongly calculated. Interchange between the sum and the square root.
23. 21/3/2014: Version 4.7. Removal of the function for Fisher type regression for angular
response variables.
22. 20/3:2014: Version 4.7. Addition of the option to set seed in the functions knn.tune,
kern.tune and pcr.tune (previously known as pcr.fold). This allows to compare the MSPE
between these three different methods.
21. 20/3/2014: Version 4.7. Change in the functions kfold.da and bckfold.da. Correction of
the confidence limits if they happen to go outside 0 or 1. In the bckfold.da I made sure
that the same test samples are always used for the values of the power parameter λ.
In this way the estimated percentage of correct classification is comparable in a fair
way. Change of the title also. A similar change took place for the function knn.tune, so
that the MSPE for every value of the bandwidth parameter h is based on the same test
samples. This change was also made in the function pcr.fold as well. Actually in the
pcr.fold this was already happening but now the user can obtain the test samples used.
The k-NN and kernel regressions accept univariate dependent variables now.
356
20. 18/3/2014: Version 4.6. Correction of a foolish mistake in the functions
textiteuclid and euclid.inv. It did not handle correctly vectors and data which were not
in matrix class.
19. 17/3/2014: Version 4.6. Fix of a problem with negative eigenvalues in the Fisher-
Bingham normalizing constant.
18. 13/3/2014: Version 4.6. Addition of a second type correlation coefficient for pairs
of angular variables. The new function is circ.cor2. The old function is now called
circ.cor1 and a couple of typograhical mistakes inside it are now corrected. A change
in the functions vm.reg and spml.reg. The calculation of the pseudo-R2 changed. A
change in the function circ.summary also. Minor typographical changes and removal
of a few lines in the functionden.contours which do not affect the function at all.
17. 12/3/2014: Version 4.5. Fixation of a possible problem with the column names in the
multivariate regression (function multivreg). Small changes in the function itself as
well.
16. 12/3/2014: Version 4.5. Fixation of a typographical error in the description of the algo-
rithm for simulating random values from a von Mises-Fisher distribution and chang-
ing the functions euclid and euclid.inv to include the case of vectors, not only matrices.
15. 10/3/2014: Version 4.5. Addition of the circular-linear correlation coefficient. Addi-
tion of the bootstrap calculation of the p-value in the circular correlation. Fixation of a
typographical error in the function circ.summary.
14. 8/3/2014: Version 4.4 Addition of the Box-Cox transformation for discriminant analy-
sis. Expansion of the multivariate regression function multivreg. Some morphological
changes also.
13. 7/3/2014: Version 4.3. Addition of the L1 metric kernel in the kernel regression and
change of the names of the two kernel regression functions. Addition of some words
as well.
12. 6/3/2014: Version 4.2. Addition of one line for the column names in the functions
euclid and euclid.inv. Morphological changes in the Section of discrimination and minor
changes in the function kfold.da. Removal of the command library(MASS) from multivt
and den.contours.
11. 4/3/2014: Version 4.2. Addition of a function to generate from a multivariate normal
distribution. A change in the Nadaraya-Watson case of the kernel regression func-
tion. A change in the variance of the coefficients in the principal component regression
function. Addition of some words in the standardization section and in the hypothesis
testing for a zero correlation coefficient.
357
10. 1/3/2014: Version 4.1. Fixation of an error in the function poly.tune.
7. 17/2/2014: Version 4.0. Correction in the poly.reg function (kernel regression). Some
changes also in the introduction.
6. 16/2/2014: Version 4.0. Correction in the function pcr.fold (Cross validation for prin-
cipal component regression). Instead of BIC I use now MSPE and a correction on the
centering of the dependent variable.
3. 13/2/2014: Version 4.0. Change of the poly.tune function. The cross-validation for the
choice of the common bandwidth h is implemented by diving the sample to test and
training sets many times. Improved cross validation. A change in the function poly.reg
also.
1. 11/2/2014: Version 3.9. Change of the Bingham random value simulation function
with the function given by Christopher Fallaize and Theo Kypraios.
358
References
Abramowitz, M. and Stegun, I. (1970). Handbook of mathematical functions. New York: Dover
Publishing Inc.
Agostinelli, C. and Lund, U. (2011). R package circular: Circular Statistics. R package version
0.4-3.
Agresti, A. (2002). Categorical data analysis, 2nd edition. New Jersey: John Wiley & Sons.
Aguilera, A. M., Escabias, M., and Valderrama, M. J. (2006). Using principal components for
estimating logistic regression with high-dimensional multicollinear data. Computational
Statistics & Data Analysis, 50(8):1905–1924.
Aitchison, J. (2003). The Statistical Analysis of Compositional Data. New Jersey: (Reprinted by)
The Blackburn Press.
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., and Koutsoukos, X. D. (2010). Lo-
cal causal and markov blanket induction for causal discovery and feature selection for
classification part i: Algorithms and empirical evaluation. The Journal of Machine Learning
Research, 11:171–234.
Amaral, G. J. A., Dryden, I. L., and Wood, A. T. A. (2007). Pivotal bootstrap methods for
k-sample problems in directional statistics and shape analysis. Journal of the American
Statistical Association, 102(478):695–707.
Atkinson, A. C., Riani, M., and Cerioli, A. (2004). Exploring multivariate data with the forward
search. Springer.
Azzalini, A. (2005). The skew-normal distribution and related multivariate families*. Scan-
dinavian Journal of Statistics, 32(2):159–188.
Azzalini, A. (2011). R package sn: The skew-normal and skew-t distributions (version 0.4-17).
Università di Padova, Italia.
359
Azzalini, A. and Valle, A. D. (1996). The multivariate skew-normal distribution. Biometrika,
83(4):715–726.
Baxter, M. J., Beardah, C. C., Cool, H. E. M., and Jackson, C. M. (2005). Compositional data
analysis of some alkaline glasses. Mathematical Geology, 37(2):183–196.
Bengtsson, H. (2014). R.matlab: Read and write of MAT files together with R-to-MATLAB con-
nectivity. R package version 2.2.3.
Boyles, R. A. (1997). Using the chi-square statistic to monitor compositional process data.
Journal of Applied Statistics, 24(5):589–602.
Breusch, T. S., Robertson, J. C., and Welsh, A. H. (1997). The emperor’s new clothes: a
critique of the multivariate t regression model. Statistica Neerlandica, 51(3):269–286.
Brown, P. J. and Zidek, J. V. (1980). Adaptive multivariate ridge regression. The Annals of
Statistics, 8(1):64–74.
Browne, R. P., ElSherbiny, A., and McNicholas, P. D. (2015). mixture: Mixture Models for
Clustering and Classification. R package version 1.4.
Casella, G. and Berger, R. L. (2002). Statistical inference. Duxbury Pacific Grove, CA.
Chen, S. X., Qin, Y.-L., et al. (2010). A two-sample test for high-dimensional data with appli-
cations to gene-set testing. The Annals of Statistics, 38(2):808–835.
Davison, A. C. and Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge
university press.
360
Dhillon, I. S. and Sra, S. (2003). Modeling data using directional distributions. Technical
report, Technical Report TR-03-06, Department of Computer Sciences, The University of
Texas at Austin.
Efron, B. and Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman & Hall/CRC.
Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., and Barceló-Vidal, C. (2003). Iso-
metric logratio transformations for compositional data analysis. Mathematical Geology,
35(3):279–300.
Eltoft, T., Kim, T., and Lee, T.-W. (2006). On the multivariate laplace distribution. Signal
Processing Letters, IEEE, 13(5):300–303.
Emerson, S. (2009). Small sample performance and calibration of the Empirical Likelihood method.
PhD thesis, Stanford university.
Endres, D. M. and Schindelin, J. E. (2003). A new metric for probability distributions. Infor-
mation Theory, IEEE Transactions on, 49(7):1858–1860.
Fallaize, C. J. and Kypraios, T. (2014). Exact bayesian inference for the bingham distribution.
arXiv preprint arXiv:1401.2894.
Fieller, E. C., Hartley, H. O., and Pearson, E. S. (1957). Tests for rank correlation coefficients.
i. Biometrika, 44(3/4):470–481.
Fieller, E. C. and Pearson, E. S. (1957). Tests for rank correlation coefficients: Ii. Biometrika,
48(1/2):29–40.
Fisher, N. (1985). Spherical medians. Journal of the Royal Statistical Society. Series B (Method-
ological), 47(2):342–348.
Fisher, N. I. and Lee, A. J. (1992). Regression models for an angular response. Biometrics,
pages 665–677.
361
Fisher, N. I., Lewis, T., and Embleton, B. J. J. (1987). Statistical analysis of spherical data. Cam-
bridge University Press.
Förstner, W. and Moonen, B. (2003). A metric for covariance matrices. In Geodesy-The Chal-
lenge of the 3rd Millennium, pages 299–309. Springer.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis and den-
sity estimation. Journal of the American Statistical Association, 97:611–631.
Fraley, C., Raftery, A. E., Murphy, T. B., and Scrucca, L. (2012). mclust Version 4 for R: Normal
Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.
Garcı́a-Portugués, E. (2013). Exact risk improvement of bandwidth selectors for kernel den-
sity estimation with directional data. Electronic Journal of Statistics, 7:1655–1685.
Gini, C. and Galvani, L. (1929). Di talune estensioni dei concetti di media ai caratteri quali-
tativi. Metron, 8(1-2):3–209.
Goulet, V., Dutang, C., Maechler, M., Firth, D., Shapira, M., and Stadelmann, M. (2013).
expm: Matrix exponential. R package version 0.99-0.
Gregory, K. (2014). highD2pop: Two-Sample Tests for Equality of Means in High Dimension. R
package version 1.0.
Guidoum, A. C. (2015). kedd: Kernel estimator and bandwidth selection for density and its deriva-
tives. R package version 1.0.2.
Habbema, J. D. F., Hermans, J., and Van den Broek, K. (1974). A stepwise discrimination
analysis program using density estimation. In Compstat 1974: Proceedings in Computational
Statistics, Vienna.
Hadi, A. S. and Ling, R. F. (1998). Some cautionary notes on the use of principal components
regression. The American Statistician, 52(1):15–19.
362
Haldane, J. B. S. (1948). Note on the median of a multivariate distribution. Biometrika, 35(3-
4):414–417.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning: data
mining, inference, and prediction. Springer, Berlin.
Hijazi, R. H. (2006). Residuals and diagnostics in dirichlet regression. ASA Proceedings of the
General Methodology Section, pages 1190–1196.
Hornik, K. and Grün, B. (2014). movMF: An R package for fitting mixtures of von mises-
fisher distributions. Journal of Statistical Software, 58(10):1–31.
James, G. S. (1954). Tests of linear hypotheses in univariate and multivariate analysis when
the ratios of the population variances are unknown. Biometrika, 41(1/2):19–43.
Jupp, P. E. (2001). Modifications of the rayleigh and bingham tests for uniformity of direc-
tions. Journal of Multivariate Analysis, 77(2):1–20.
Kärkkäinen, T. and Äyrämö, S. (2005). On computation of spatial median for robust data
mining. Evolutionary and Deterministic Methods for Design, Optimization and Control with
Applications to Industrial and Societal Problems, EUROGEN, Munich.
Karlis, D. (2003). An em algorithm for multivariate poisson distribution and related models.
Journal of Applied Statistics, 30(1):63–77.
Karlis, D. and Ntzoufras, I. (2003). Analysis of sports data by using bivariate poisson models.
Journal of the Royal Statistical Society: Series D (The Statistician), 52(3):381–393.
Kawamura, K. (1984). Direct calculation of maximum likelihood estimator for the bivariate
poisson distribution. Kodai mathematical journal, 7(2):211–221.
Kent, J. T. (1982). The fisher-bingham distribution on the sphere. Journal of the Royal Statistical
Society. Series B (Methodological), pages 71–80.
363
Kent, J. T., Ganeiber, A. M., and Mardia, K. V. (2013). A new method to simulate the bingham
and related distributions in directional data analysis with applications. arXiv preprint
arXiv:1310.8110.
Kim, J. and Scott, C. D. (2012). Robust kernel density estimation. The Journal of Machine
Learning Research, 13(1):2529–2565.
Krishnamoorthy, K. and Xia, Y. (2006). On selecting tests for equality of two normal mean
vectors. Multivariate Behavioral Research, 41(4):533–548.
Krishnamoorthy, K. and Yu, J. (2004). Modified Nel and Van der Merwe test for the multi-
variate Behrens-Fisher problem. Statistics & Probability Letters, 66(2):161–169.
Kullback, S. (1997). Information theory and statistics. New York: Dover Publications.
Kume, A., Preston, S., and Wood, A. T. (2013). Saddlepoint approximations for the normaliz-
ing constant of fisher–bingham distributions on products of spheres and stiefel manifolds.
Biometrika, 100(4):971–984.
Kume, A. and Wood, A. T. A. (2005). Saddlepoint approximations for the bingham and
fisher–bingham normalising constants. Biometrika, 92(2):465–476.
Kwangil, R., Changliang, Z., Zhaojun, Wang, and Guosheng, Y. (2015). Outlier detection for
high-dimensional data. Biometrika, 102(3):589–599.
Lagani, V., Kortas, G., and Tsamardinos, I. (2013). Biomarker signature identification in
“omics” data with multi-class outcome. Computational and structural biotechnology journal,
6(7):1–7.
Lange, K. L., Little, R. J., and Taylor, J. M. (1989). Robust statistical modeling using the t
distribution. Journal of the American Statistical Association, 84(408):881–896.
Le, H. and Small, C. G. (1999). Multidimensional scaling of simplex shapes. Pattern Recogni-
tion, 32(9):1601–1613.
Loukas, S. and Kemp, C. (1986). The index of dispersion test for the bivariate poisson distri-
bution. Biometrics, pages 941–948.
364
Lund, U. (1999). Least circular distance regression for directional data. Journal of Applied
Statistics, 26(6):723–733.
Lund, U. and Agostinelli, C. (2012). CircStats: Circular Statistics, from ”Topics in circular Statis-
tics” (2001). R package version 0.2-4.
Mardia, K. V. and Jupp, P. E. (2000). Directional statistics. Chicester: John Wiley & Sons.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis. London: Academic
Press.
Mardia, K. V. and Mardia, K. V. (1972). Statistics of directional data. Academic Press London.
Mavridis, D. and Moustaki, I. (2008). Detecting outliers in factor analysis using the forward
search algorithm. Multivariate behavioral research, 43(3):453–475.
Minka, T. (2000). Estimating a dirichlet distribution. Technical report, Technical report, MIT.
Minka, T. P. (2001). Automatic choice of dimensionality for pca. In Leen, T., Dietterich, T.,
and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 598–604.
MIT Press.
Moler, C. and Van Loan, C. (2003). Nineteen dubious ways to compute the exponential of a
matrix, twenty-five years later. SIAM review, 45(1):3–49.
Möttönen, J., Nordhausen, K., Oja, H., et al. (2010). Asymptotic theory of the spatial median.
In Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis:
A Festschrift in honor of Professor Jana Jurečková, pages 182–193. Institute of Mathematical
Statistics.
365
Nadarajah, S. and Kotz, S. (2008). Estimation methods for the multivariate t distribution.
Acta Applicandae Mathematicae, 102(1):99–118.
Nelder, J. and Mead, R. (1965). A simplex algorithm for function minimization. Computer
Journal, 7(4):308–313.
Ng, K. W., Tian, G. L., and Tang, M. L. (2011). Dirichlet and Related Distributions: Theory,
Methods and Applications, volume 889. Chichester: John Wiley & sons.
Nychka, D., Furrer, R., and Sain, S. (2015). fields: Tools for Spatial Data. R package version
8.2-1.
Oliveira, M., Crujeiras, R. M., and Rodrı́guez-Casal, A. (2013). NPCirc: Nonparametric Circu-
lar Methods. R package version 2.0.0.
Opgen-Rhein, R. and Strimmer, K. (2006). Inferring gene dependency networks from ge-
nomic longitudinal data: a functional data approach. Revstat, 4(1):53–65.
Österreicher, F. and Vajda, I. (2003). A new class of metric divergences on probability spaces
and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):639–
653.
Papadakis, M., Tsagris, M., Dimitriadis, M., Fafalios, S., Tsamardinos, I., Fasiolo, M., Bor-
boudakis, G., Burkardt, J., Zou, C., and Lakiotaki, K. (2019). Rfast: Fast R Functions. R
package version 1.9.4.
Pawlowsky Glahn, V., Egozcue, J. J., and Tolosana Delgado, R. (2007). Lecture notes on com-
positional data analysis.
Pewsey, A., Neuhäuser, M., and Ruxton, G. D. (2013). Circular Statistics in R. Oxford Univer-
sity Press.
Presnell, B., Morrison, S. P., and Littell, R. C. (1998). Projected multivariate linear models for
directional data. Journal of the American Statistical Association, 93(443):1068–1077.
366
Rajan, J. and Rayner, P. (1997). Model order selection for the singular value decomposition
and the discrete karhunen–loeve transform using a bayesian approach. IEE Proceedings-
Vision, Image and Signal Processing, 144(2):116–123.
Rauber, T. W., Braunb, T., and Berns, K. (2008). Probabilistic distance measures of the dirich-
let and beta distributions. Pattern Recognition, 41:637–645.
Rayleigh, L. (1919). On the problem of random vibrations, and of random flights in one,
two, or three dimensions. The London, Edinburgh, and Dublin Philosophical Magazine and
Journal of Science, 37(220):321–347.
Rayner, J. C., Thas, O., and Best, D. J. (2009). Smooth tests of goodness of fit: using R. John
Wiley & Sons.
Rivest, L.-P. (1986). Modified kent’s statistics for testing goodness of fit for the fisher distri-
bution in small concentrated samples. Statistics & probability letters, 4(1):1–4.
Robert, P. W. (1976). On the choice of smoothing parameters for parzen estimators of prob-
ability density functions. IEEE Transactions on Computers.
Scealy, J. and Welsh, A. (2014). Fitting kent models to compositional data with small con-
centration. Statistics and Computing, 24(2):165–179.
Scealy, J. L. and Welsh, A. H. (2011b). Regression for compositional data by using distribu-
tions defined on the hypersphere. Journal of the Royal Statistical Society. Series B, 73(3):351–
375.
Schaefer, J., Opgen-Rhein, R., and Strimmer, K. (2007). corpcor: efficient estimation of co-
variance and (partial) correlation. r package version 1.4. 7.
Schnute, J. T. and Haigh, R. (2007). Compositional analysis of catch curve data, with an
application to sebastes maliger. ICES Journal of Marine Science, 64:218–233.
Shabalin, A. A. (2012). Matrix eqtl: ultra fast eqtl analysis via large matrix operations. Bioin-
formatics, 28(10):1353–1358.
Sharp, W. (2006). The graph median–a stable alternative measure of central tendency for
compositional data sets. Mathematical geology, 38(2):221–229.
367
Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26. New
York: CRC press.
Sra, S. (2012). A short note on parameter approximation for von mises-fisher distributions:
and a fast implementation of i s (x). Computational Statistics, 27(1):177–190.
Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., and Levy, S. (2005). A comprehen-
sive evaluation of multicategory classification methods for microarray gene expression
cancer diagnosis. Bioinformatics, 21(5):631–643.
Stephens, M. A. (1972). Multri-sample tests for the von mises distribution. Technical report,
Technical Report 190, Department of Statistics, Stanford University (130, 135).
Stephens, M. A. (1982). Use of the von mises distribution to analyse continuous proportions.
Biometrika, 69(1):197–203.
Taylor, C. C. (2008). Automatic bandwidth selection for circular density estimation. Compu-
tational Statistics & Data Analysis, 52(7):3493–3500.
Tibshirani, R. J. and Tibshirani, R. (2009). A bias correction for the minimum error rate in
cross-validation. The Annals of Applied Statistics, 3(1):822–829.
Todorov, V. and Filzmoser, P. (2010). Robust statistic for the one-way manova. Computational
Statistics & Data Analysis, 54(1):37–48.
Tsagris, M. (2014). The k-nn algorithm for compositional data: a revised approach with and
without zero values present. Journal of Data Science, 12(3):519–534.
Tsagris, M. (2015a). A novel, divergence based, regression for compositional data. In Pro-
ceedings of the 28th Panhellenic Statistics Conference, Athens, Greece.
Tsagris, M. (2015b). Regression analysis with compositional data containing zero values.
Chilean Journal of Statistics, 6(2):47–57.
Tsagris, M., Athineou, G., and Sajib, A. (2016a). Directional: A collection of R functions for
directional data analysis. R package version 2.1.
368
Tsagris, M., Elmatzoglou, I., and Frangos, C. C. (2012). The assessment of performance
of correlation estimates in discrete bivariate distributions using bootstrap methodology.
Communications in Statistics-Theory and Methods, 41(1):138–152.
Tsagris, M. and G., A. (2016). Compositional: A collection of R functions for compositional data
analysis. R package version 1.5.
Tsagris, M. and Papadakis, M. (2018). Taking r to its limits: 70+ tips. PeerJ PrePrints, page
e26605v1.
Tsagris, M., Preston, S., and Wood, A. T. (2016b). Improved classification for compositional
data using the al pha-transformation. Journal of Classification, To appear.
Tsagris, M. T., Preston, S., and Wood, A. T. A. (2011). A data-based power transformation for
compositional data. In Proceedings of the 4rth Compositional Data Analysis Workshop, Girona,
Spain.
Tyler, D. E. (1987). Statistical analysis for the angular central gaussian distribution on the
sphere. Biometrika, 74(3):579–589.
Van Den Boogaart, K. G. and Tolosana-Delgado, R. (2013). Analyzing Compositional Data with
R. Springer.
Varadhan, R. and Gilbert, P. (2009). BB: An R package for solving a large system of nonlinear
equations and for optimizing a high-dimensional nonlinear objective function. Journal of
Statistical Software, 32(4):1–26.
Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics,
Series A, 26(4):359–372.
Watson, G. S. (1983a). Large sample theory of the langevin distribution. Journal of Statistical
Planning and Inference, 8(1):245–256.
Watson, G. S. and Nguyen, H. (1985). A Confidence Region in a Ternary Diagram from Point
Counts. Mathematical Geology, 17(2):209–213.
369
Wood, A. (1982). A bimodal distribution on the sphere. Applied Statistics, 31(1):52–58.
Woronow, A. (1997). The elusive benefits of logratios. In Proceedings of the 3rd Annual Con-
ference of the International Association for Mathematical Geology, Barcelona, Spain.
Yee, T. W. (2010). The VGAM package for categorical data analysis. Journal of Statistical
Software, 32(10):1–34.
370