Statests
Statests
Statests
1
Introduction
Statistics is a branch of science, more precisely a branch of mathematics
that is concerned with collection, analysis, interpretation, and presentation
of data. A very similar term, statistic denotes a single measure, a number
that is describing in a concise way some feature of a data sample. Statistic
is usually a number resulting from application of a statistical procedure.
Examples of a statistic include, sample mean and standard deviation of a
sample.
0.1 Mean
Mean is a location parameter measuring the central tendency of the data. It
is defined as a sum of all the values divided by the number of observations:
P
x
µ= (1)
N
Population mean is denoted P as µ while sample mean as X̄. Here, N is the
number of observations and x is the sum of all observed values.
Mean can be misleading when a distribution is skewed (non-symmetrical)
and can be greatly influenced by outliers. To remedy the latter problem,
other types of mean such as weighted mean or a mean that does not take
into account the extreme observations called the Winsor mean are used.
We, however, will not discuss these here.
0.2 Median
Median is another measure of the central tendency or a location parameter of
a distribution. Unlike the mean, it is not taking into account all observations.
Median is simply the middle value in the ordered data series. If there is an
odd number of data points, median is easy to find. If there is an even number
2
of data points, median is defined as the mean of the two data points in the
middle.
By comparing mean with median, we can tell something about how
skewed the distribution is. For a perfectly symmetrical distribution median
and mean are equal. For left-skewed distribution, median is less than the
mean. For right-skewed distributions, median is greater than the mean.
0.3 Variance
Population variance is denoted as σ 2 and sample variation as s. Variance
measures dispersion of observations around the mean. The more spread they
are, the higher value of the variance. Population variance is defined as:
(x − µ)2
P
2
σ = (2)
n
while sample variation is:
(x − x̄)2
P
s2 = (3)
n−1
Why are we dividing by (n − 1) instead of dividing by n? We want our
sample variance s2 to be as accurate estimate of population variance σ 2 as
possible. Imagine that we are estimating variance in weight of sheep in a herd
(population) of 2000 individuals using a sample of 200 animals. It is very
likely that we will miss in our sample those few light or heavy individuals
which would have, otherwise, influenced our estimation of variance quite
a lot (since they give very large (x − x̄)2 term). Therefore we correct for
this byusing n − 1 instead of n. Observe that for very small sample sizes
subtraction of 1 has a large effect on variance while for large sample sizes it
has very minute effect:
3
s <- array(dim=N) # Sample variance (with correction)
sigma <- array(dim=N) # Population variance
for (i in 1:N) {
my.sample <- sample(pop, size=sample.size, replace=F)
s[i] <- var(my.sample)
sigma[i] <- pop.var(my.sample)
}
result <- c(median(s), median(sigma))
return(result)
}
# Actual simulation
sample.sizes <- c(5:100)
result <- c()
for (sample.size in sample.sizes) {
tmp <- compare.var(pop, N, sample.size)
new <- cbind(sample.size, s=tmp[1], sigma=tmp[2])
result <- rbind(result, new)
}
result <- as.data.frame(result)
You can perhaps see that for small sample sizes, the effect of using n − 1
instead of 1 is much larger!
Note that, since we are squaring the numerator, variance cannot take
negative values! Often, a square root of variance called standard deviation
is used instead of variance itself. Standard deviation (sd, SD, std.dev.) is
denoted by σ for population and by s for sample.
If two or more traits are measured using different scales, variances cannot
be directly compared due to scale effect. Therefore, often variables to be com-
pared are standardized so that they follow the standard normal distribution
N (0, 1) with X̄ = 0 and s = 1. Standardization maps the actual values into
the corresponding z-scores which tell how many standard deviations away
from the mean a given observation is: zscore = (x − x̄)/s.
4
Figure 1: The effect of using n − 1 in denominator of variance estimation.
Black dots – no correction, red dots – corrected. True population variance
–the green dotted line.
0.4 Covariance
When looking at two variables, say x and y, it is often interesting to know
how similar the deviations from the mean of one variable are to the deviations
of the other variable. This is measured by covariance:
P
(x − x̄) · (y − ȳ)
Covx,y = (4)
n−1
Observe that variance is a covariance of a variable with itself – substitute
(y − ȳ) with (x − x̄) to get variance. Similarly to variance, covariance is
scale dependent! Positive covariance means that y increases with x, for the
opposite situation, we have negative covariance. When the two variables are
orthogonal to each other (independent), covariance equals zero.
0.5 Correlation
As mentioned above, covariance is scale dependent. It can, however, easily
be re-sacled to be bound between -1 and 1. This operation is analoguous to
standardization. Such re-scaling yields correlation:
Covx,y
Corx,y = p 2 2 (5)
sx · sy
5
To see what is the scale effect on covariance, look at the following example:
## [1] 3.825229
cor(w, x)
## [1] 1
## [1] 342.2959
cor(y, z)
## [1] 1
Statistical tests
Here, we will discuss some most common and useful statistical tests. First,
there is a very important distinction between two types of statistical tests:
• parametric tests – they assume certain parameters of the population
and sample distribution. Usually these have to be normal and µ and σ
have to be known.
One-sample tests
First, let us sonsider tests where we are considering one sample. The very
first step is to check normality of the sample distribution using, e.g. Shapiro-
Wilk test:
6
data <- rnorm(n = 100, mean = 0, sd = 1)
shapiro.test(data)
##
## Shapiro-Wilk normality test
##
## data: data
## W = 0.99178, p-value = 0.8049
##
## Shapiro-Wilk normality test
##
## data: data2
## W = 0.95623, p-value = 0.002186
Now, p-value is less than 0.05 and gives us a reason to reject the null.
We can suspect the distribution is departing from normality. Finally, let us
compare the two samples using a quantile-quantile (Q-Q) plot:
At the next step, one may wish to check whether the sample comes from a
population characterized by a given mean X̄. This can be accomplished with
a simple Student’s t-test (BTW, the t-statistics and t-test were invented by
William Sealy Gosset. He was an employee of Guiness brewery and invented
this test to monitor quality of stout!):
t.test(data, mu = 1.1)
##
7
Figure 2: Q-Q plots for data (left) and data2 (right). Data points on the
left panel clearly follow the straight line which is a sign of normality. This is
not the case for datapoints on the right panel. Shapiro-Wilk tests confirmed
this observation.
As we can see, very small value of p-value lets us reject our nill hypothesis
H0 : sample comes f rom population with mean X̄ = 1.1. We can also
see that the 95% confidence interval is between −0.294 and 0.144, i.e. the
population mean from which the sample comes is somewhere in this interval.
We can also ask another type of question: does our sample come from a
population with a given variance? To answer this question, we will use the
Z test:
8
std.err <- sigma/sqrt(length(data))
conf.interval = mean(data) + c(-std.err * conf, std.err * conf)
wilcox.test(data2, mu = 3)
##
## Wilcoxon signed rank test with continuity correction
##
## data: data2
## V = 2453, p-value = 0.8058
## alternative hypothesis: true location is not equal to 3
Two-sample tests
In this section we will consider two samples. We want to know whether
they come from the same population. We begin by testing whether our two
samples have homogenous variance. This can be done using Sendecor F-test:
##
## F test to compare two variances
##
## data: data and data3
## F = 0.09183, num df = 99, denom df = 99, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.06178734 0.13648148
## sample estimates:
## ratio of variances
## 0.09183043
var.test(data, data4)
9
##
## F test to compare two variances
##
## data: data and data4
## F = 0.984, num df = 99, denom df = 99, p-value = 0.9362
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6620735 1.4624480
## sample estimates:
## ratio of variances
## 0.983996
t.test(data, data4)
##
## Welch Two Sample t-test
##
## data: data and data4
## t = 0.43805, df = 197.99, p-value = 0.6618
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2125453 0.3339357
## sample estimates:
## mean of x mean of y
## 0.12984179 0.06914657
Apparently, there is no reason to think (at α = 0.95) that the two samples
come from different populations. For data and data3, we cannot apply t-
test since the variance was not homogenous. We need its non-parametric
counterpart, U-Mann-Whitney test (implemented in wilcox.test):
wilcox.test(data, data3)
##
## Wilcoxon rank sum test with continuity correction
##
## data: data and data3
## W = 4935, p-value = 0.8748
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(data, data2)
10
##
## Wilcoxon rank sum test with continuity correction
##
## data: data and data2
## W = 44, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## Bartlett test of homogeneity of variances
##
## data: data by group
## Bartlett's K-squared = 288.58, df = 2, p-value < 2.2e-16
library(car)
We can see that both tests give similar answer – there is no variance
homogeneity between these three samples. This implies the use of a non-
parametric test to check whether all three samples come from a population
with a given mean. We will use Kruskal-Wallis test:
kruskal.test(data ~ group, data.new)
##
## Kruskal-Wallis rank sum test
##
## data: data by group
## Kruskal-Wallis chi-squared = 131.96, df = 2, p-value < 2.2e-16
In this case, apparently not all the samples come from the same popula-
tion with the same mean.
11
Usefulness of χ2 tests
Below, we will hava a closer look at a class of very useful tests based on χ2
statistic. In general, χ2 tests are used when we consider counts or propor-
tions.
(Nexp − Nobs )2
χ2 = (6)
Nexp
The χ2 distribution is also relatively simple to derive by using simulations in
R. Let us assume a population with known ratio of two classes, e.g. popu-
lation of a poll respondents who answered “yes” or “no” no to a particular
question. We know the ratio of these categories in our population and start
drawing samples of a given size from the population. We would like to know
how many times the “yes” to “no” ratio in the sample matches the “yes” to
“no” ratio in the population. In the simulation below, we encode “yes” and
“no” as TRUE and FALSE respectively.
12
Figure 3: Simulated χ2 distribution for k = 2.
13
all the factors like other bacteria that may grow optimally at a different
temperature, lack of replicates etc. We will use a χ2 test:
##
## 1-sample proportions test with continuity correction
##
## data: 30 out of 100, null probability 0.25
## X-squared = 1.08, df = 1, p-value = 0.2987
## alternative hypothesis: true p is not equal to 0.25
## 95 percent confidence interval:
## 0.2145426 0.4010604
## sample estimates:
## p
## 0.3
prop.test(55, n = 100)
##
## 1-sample proportions test with continuity correction
##
## data: 55 out of 100, null probability 0.5
## X-squared = 0.81, df = 1, p-value = 0.3681
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4475426 0.6485719
## sample estimates:
## p
## 0.55
Well, it gives them right to say that the candidate will get from 45% to
65% of the votes. . .
14
• Site 1 – 55 out of 100 declared support.
##
## 4-sample test for equality of proportions without continuity
## correction
##
## data: votes out of votes.tot
## X-squared = 4.7848, df = 3, p-value = 0.1883
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 0.5500000 0.5000000 0.4550000 0.5172414
• two - 14 times,
• three - 9 times,
• four - 11 times,
• five - 15 times,
• six - 5 times,
Now, we are wondering whether the dice is fair. . .
15
results <- c(7, 14, 9, 11, 15, 5)
probs <- rep(1/6, 6)
chisq.test(x = results, p = probs)
##
## Chi-squared test for given probabilities
##
## data: results
## X-squared = 7.5574, df = 5, p-value = 0.1824
• no change
##
## Pearson's Chi-squared test
##
## data: data
## X-squared = 17.941, df = 2, p-value = 0.0001271
As we can see, we can reject H0 : group1 and group2 are not independent!
This means there is a difference between the “treated” and the placebo group.
16
Restoring normality
Sometimes you can restore normality of your variable x by using one of the
transformations specified in this document.
Box-Cox transformations
If your response variable y is non-linear, you can try to restore linearity using
a Box-Cox family transformation. The Box-Cox transformation class is:
(
0 (yiλ − 1)/λ if λ 6= 0.
yi =
log(yi ) if λ = 0
17
We can also create our own dataset and test Box-Cox on it:
# Create dataset
x <- rnorm(100, mean = 10, 2)
# We will have x^3 relation
y <- 2 * x^3
dat <- data.frame(x, y)
# Plot data points
plot(dat$x, dat$y, pch = 19, cex = 0.5)
# Estimate the best lambda
bc <- boxcox(lm(y ~ x, data = dat), lambda = seq(0.2, 0.5, by = 0.1))
lambda <- bc$x[which.max(bc$y)]
library(car)
18
## Error in library(car): there is no package called ’car’
19
20