Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CH Density Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

A Handbook of Statistical Analyses

Using R
Brian S. Everitt and Torsten Hothorn
CHAPTER 7
Density Estimation: Erupting Geysers
and Star Clusters
7.1 Introduction
7.2 Density Estimation
The three kernel functions are implemented in R as shown in lines 13 of
Figure7.1. For some grid x, the kernel functions are plotted using the R
statements in lines 511 (Figure7.1).
The kernel estimator

f is a sum of bumps placed at the observations.
The kernel function determines the shape of the bumps while the window
width h determines their width. Figure7.2 (redrawn from a similar plot in
Silverman, 1986) shows the individual bumps n
1
h
1
K((xx
i
)/h), as well as
the estimate

f obtained by adding them up for an articial set of data points
R> x <- c(0, 1, 1.1, 1.5, 1.9, 2.8, 2.9, 3.5)
R> n <- length(x)
For a grid
R> xgrid <- seq(from = min(x) - 1, to = max(x) + 1, by = 0.01)
on the real line, we can compute the contribution of each measurement in x,
with h = 0.4, by the Gaussian kernel (dened in Figure7.1, line 3) as follows;
R> h <- 0.4
R> bumps <- sapply(x, function(a) gauss((xgrid - a)/h)/(n * h))
A plot of the individual bumps and their sum, the kernel density estimate

f,
is shown in Figure7.2.
7.3 Analysis Using R
7.3.1 A Parametric Density Estimate for the Old Faithful Data
R> logL <- function(param, x) {
+ d1 <- dnorm(x, mean = param[2], sd = param[3])
+ d2 <- dnorm(x, mean = param[4], sd = param[5])
+ -sum(log(param[1] * d1 + (1 - param[1]) * d2))
+ }
R> startparam <- c(p = 0.5, mu1 = 50, sd1 = 3, mu2 = 80, sd2 = 3)
R> opp <- optim(startparam, logL, x = faithful$waiting,
3
4 DENSITY ESTIMATION
1 R> rec <- function(x) (abs(x) < 1) * 0.5
2 R> tri <- function(x) (abs(x) < 1) * (1 - abs(x))
3 R> gauss <- function(x) 1/sqrt(2*pi) * exp(-(x^2)/2)
4 R> x <- seq(from = -3, to = 3, by = 0.001)
5 R> plot(x, rec(x), type = "l", ylim = c(0,1), lty = 1,
6 + ylab = expression(K(x)))
7 R> lines(x, tri(x), lty = 2)
8 R> lines(x, gauss(x), lty = 3)
9 R> legend(-3, 0.8, legend = c("Rectangular", "Triangular",
10 + "Gaussian"), lty = 1:3, title = "kernel functions",
11 + bty = "n")
Figure 7.1 Three commonly used kernel functions.
ANALYSIS USING R 5
1 R> plot(xgrid, rowSums(bumps), ylab = expression(hat(f)(x)),
2 + type = "l", xlab = "x", lwd = 2)
3 R> rug(x, lwd = 2)
4 R> out <- apply(bumps, 2, function(b) lines(xgrid, b))
1 0 1 2 3 4
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
x
f ^
(
x
)
Figure 7.2 Kernel estimate showing the contributions of Gaussian kernels evalu-
ated for the individual observations with bandwidth h = 0.4.
+ method = "L-BFGS-B",
+ lower = c(0.01, rep(1, 4)),
+ upper = c(0.99, rep(200, 4)))
R> opp
$par
p mu1 sd1 mu2 sd2
0.3608912 54.6121396 5.8723774 80.0934102 5.8672823
$value
[1] 1034.002
6 DENSITY ESTIMATION
R> epa <- function(x, y)
+ ((x^2 + y^2) < 1) * 2/pi * (1 - x^2 - y^2)
R> x <- seq(from = -1.1, to = 1.1, by = 0.05)
R> epavals <- sapply(x, function(a) epa(a, x))
R> persp(x = x, y = x, z = epavals, xlab = "x", ylab = "y",
+ zlab = expression(K(x, y)), theta = -35, axes = TRUE,
+ box = TRUE)
x
y
K
(
x
,

y
)
Figure 7.3 Epanechnikov kernel for a grid between (1.1, 1.1) and (1.1, 1.1).
ANALYSIS USING R 7
1 R> data("faithful", package = "datasets")
2 R> x <- faithful$waiting
3 R> layout(matrix(1:3, ncol = 3))
4 R> hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency",
5 + probability = TRUE, main = "Gaussian kernel",
6 + border = "gray")
7 R> lines(density(x, width = 12), lwd = 2)
8 R> rug(x)
9 R> hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency",
10 + probability = TRUE, main = "Rectangular kernel",
11 + border = "gray")
12 R> lines(density(x, width = 12, window = "rectangular"), lwd = 2)
13 R> rug(x)
14 R> hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency",
15 + probability = TRUE, main = "Triangular kernel",
16 + border = "gray")
17 R> lines(density(x, width = 12, window = "triangular"), lwd = 2)
18 R> rug(x)
Gaussian kernel
Waiting times (in min.)
F
r
e
q
u
e
n
c
y
40 60 80 100
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
Rectangular kernel
Waiting times (in min.)
F
r
e
q
u
e
n
c
y
40 60 80 100
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
Triangular kernel
Waiting times (in min.)
F
r
e
q
u
e
n
c
y
40 60 80 100
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
Figure 7.4 Density estimates of the geyser eruption data imposed on a histogram
of the data.
8 DENSITY ESTIMATION
R> library("KernSmooth")
R> data("CYGOB1", package = "HSAUR")
R> CYGOB1d <- bkde2D(CYGOB1, bandwidth = sapply(CYGOB1, dpik))
R> contour(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat,
+ xlab = "log surface temperature",
+ ylab = "log light intensity")
log surface temperature
l
o
g

l
i
g
h
t

i
n
t
e
n
s
i
t
y
0.2
0
.2

0
.2

0.2
0.4
0.4

0
.
6

0.6
0.8
1
1.2
1
.4


1
.
6


1
.
8


2


2
.
2

3.4 3.6 3.8 4.0 4.2 4.4 4.6
3
.
5
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0
6
.
5
Figure 7.5 A contour plot of the bivariate density estimate of the CYGOB1 data,
i.e., a two-dimensional graphical display for a three-dimensional
problem.
ANALYSIS USING R 9
R> persp(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat,
+ xlab = "log surface temperature",
+ ylab = "log light intensity",
+ zlab = "estimated density",
+ theta = -35, axes = TRUE, box = TRUE)
lo
g

s
u
r
f
a
c
e

t
e
m
p
e
r
a
t
u
r
e
l
o
g

l
i
g
h
t

i
n
t
e
n
s
i
t
y
e
s
t
i
m
a
t
e
d

d
e
n
s
i
t
y
Figure 7.6 The bivariate density estimate of the CYGOB1 data, here shown in a
three-dimensional fashion using the persp function.
$counts
function gradient
55 55
$convergence
[1] 0
Of course, optimising the appropriate likelihood by hand is not very con-
venient. In fact, (at least) two packages oer high-level functionality for esti-
10 DENSITY ESTIMATION
mating mixture models. The rst one is package mclust (Fraley etal., 2006)
implementing the methodology described in Fraley and Raftery (2002). Here,
a Bayesian information criterion (BIC) is applied to choose the form of the
mixture model:
R> library("mclust")
R> mc <- Mclust(faithful$waiting)
R> mc
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
and the estimated means are
R> mc$parameters$mean
1 2
54.62491 80.09741
with estimated standard deviation (found to be equal within both groups)
R> sqrt(mc$parameters$variance$sigmasq)
[1] 5.868075
The proportion is p = 0.36. The second package is called exmix whose func-
tionality is described by Leisch (2004). A mixture of two normals can be tted
using
R> library("flexmix")
R> fl <- flexmix(waiting ~ 1, data = faithful, k = 2)
with p = 0.36 and estimated parameters
R> parameters(fl, component = 1)
Comp.1
coef.(Intercept) 54.628701
sigma 5.895234
R> parameters(fl, component = 2)
Comp.2
coef.(Intercept) 80.098582
sigma 5.871749
We can get standard errors for the ve parameter estimates by using a
bootstrap approach (see Efron and Tibshirani, 1993). The original data are
slightly perturbed by drawing n out of n observations with replacement and
those articial replications of the original data are called bootstrap samples.
Now, we can t the mixture for each bootstrap sample and assess the vari-
ability of the estimates, for example using condence intervals. Some suitable
R code based on the Mclust function follows. First, we dene a function that,
for a bootstrap sample indx, ts a two-component mixture model and returns
p and the estimated means (note that we need to make sure that we always
get an estimate of p, not 1 p):
ANALYSIS USING R 11
R> opar <- as.list(opp$par)
R> rx <- seq(from = 40, to = 110, by = 0.1)
R> d1 <- dnorm(rx, mean = opar$mu1, sd = opar$sd1)
R> d2 <- dnorm(rx, mean = opar$mu2, sd = opar$sd2)
R> f <- opar$p * d1 + (1 - opar$p) * d2
R> hist(x, probability = TRUE, xlab = "Waiting times (in min.)",
+ border = "gray", xlim = range(rx), ylim = c(0, 0.06),
+ main = "")
R> lines(rx, f, lwd = 2)
R> lines(rx, dnorm(rx, mean = mean(x), sd = sd(x)), lty = 2,
+ lwd = 2)
R> legend(50, 0.06, lty = 1:2, bty = "n",
+ legend = c("Fitted two-component mixture density",
+ "Fitted single normal density"))
Waiting times (in min.)
D
e
n
s
i
t
y
40 50 60 70 80 90 100 110
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
0
.
0
5
0
.
0
6
Fitted twocomponent mixture density
Fitted single normal density
Figure 7.7 Fitted normal density and two-component normal mixture for geyser
eruption data.
12 DENSITY ESTIMATION
R> library("boot")
R> fit <- function(x, indx) {
+ a <- Mclust(x[indx], minG = 2, maxG = 2)$parameters
+ if (a$pro[1] < 0.5)
+ return(c(p = a$pro[1], mu1 = a$mean[1],
+ mu2 = a$mean[2]))
+ return(c(p = 1 - a$pro[1], mu1 = a$mean[2],
+ mu2 = a$mean[1]))
+ }
The function fit can now be fed into the boot function (Canty and Ripley,
2006) for bootstrapping (here 1000 bootstrap samples are drawn)
R> bootpara <- boot(faithful$waiting, fit, R = 1000)
We assess the variability of our estimates p by means of adjusted bootstrap
percentile (BCa) condence intervals, which for p can be obtained from
R> boot.ci(bootpara, type = "bca", index = 1)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootpara, type = "bca", index = 1)
Intervals :
Level BCa
95% ( 0.3041, 0.4233 )
Calculations and Intervals on Original Scale
We see that there is a reasonable variability in the mixture model, however,
the means in the two components are rather stable, as can be seen from
R> boot.ci(bootpara, type = "bca", index = 2)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootpara, type = "bca", index = 2)
Intervals :
Level BCa
95% (53.42, 56.07 )
Calculations and Intervals on Original Scale
for
1
and for
2
from
R> boot.ci(bootpara, type = "bca", index = 3)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
ANALYSIS USING R 13
boot.ci(boot.out = bootpara, type = "bca", index = 3)
Intervals :
Level BCa
95% (79.05, 81.01 )
Calculations and Intervals on Original Scale
Finally, we show a graphical representation of both the bootstrap distribu-
tion of the mean estimates and the corresponding condence intervals. For
convenience, we dene a function for plotting, namely
R> bootplot <- function(b, index, main = "") {
+ dens <- density(b$t[,index])
+ ci <- boot.ci(b, type = "bca", index = index)$bca[4:5]
+ est <- b$t0[index]
+ plot(dens, main = main)
+ y <- max(dens$y) / 10
+ segments(ci[1], y, ci[2], y, lty = 2)
+ points(ci[1], y, pch = "(")
+ points(ci[2], y, pch = ")")
+ points(est, y, pch = 19)
+ }
The element t of an object created by boot contains the bootstrap replica-
tions of our estimates, i.e., the values computed by fit for each of the 1000
bootstrap samples of the geyser data. First, we plot a simple density esti-
mate and then construct a line representing the condence interval. We apply
this function to the bootstrap distributions of our estimates
1
and
2
in
Figure7.8.
14 DENSITY ESTIMATION
R> layout(matrix(1:2, ncol = 2))
R> bootplot(bootpara, 2, main = expression(mu[1]))
R> bootplot(bootpara, 3, main = expression(mu[2]))
52 54 56
0
.
0
0
.
2
0
.
4
0
.
6

1
N = 1000 Bandwidth = 0.1489
D
e
n
s
i
t
y
( )
78 79 80 81 82
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8

2
N = 1000 Bandwidth = 0.111
D
e
n
s
i
t
y
( )
Figure 7.8 Bootstrap distribution and condence intervals for the mean estimates
of a two-component mixture for the geyser data.
Bibliography
Canty, A. and Ripley, B.D. (2006), boot: Bootstrap R (S-PLUS) Functions
(Canty), URL http://CRAN.R-project.org, R package version 1.2-29.
Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
London, UK: Chapman & Hall/CRC.
Fraley, C. and Raftery, A.E. (2002), Model-based clustering, discriminant
analysis, and density estimation, Journal of the American Statistical As-
sociation, 97, 611631.
Fraley, C., Raftery, A.E., and Wehrens, R. (2006), mclust: Model-based Clus-
ter Analysis, URL http://www.stat.washington.edu/mclust, R package
version 3.1-1.
Leisch, F. (2004), FlexMix: A general framework for nite mixture models
and latent class regression in R, Journal of Statistical Software, 11, URL
http://www.jstatsoft.org/v11/i08/.
Silverman, B. (1986), Density Estimation, London, UK: Chapman &
Hall/CRC.

You might also like