Igual-Seguà 2017 Chapter StatisticalInference
Igual-Seguà 2017 Chapter StatisticalInference
Igual-Seguà 2017 Chapter StatisticalInference
4.1 Introduction
There is not only one way to address the problem of statistical inference. In fact,
there are two main approaches to statistical inference: the frequentist and Bayesian
approaches. Their differences are subtle but fundamental:
• In the case of the frequentist approach, the main assumption is that there is a
population, which can be represented by several parameters, from which we can
obtain numerous random samples. Population parameters are fixed but they are
not accessible to the observer. The only way to derive information about these
parameters is to take a sample of the population, to compute the parameters of the
sample, and to use statistical inference techniques to make probable propositions
regarding population parameters.
• The Bayesian approach is based on a consideration that data are fixed, not the result
of a repeatable sampling process, but parameters describing data can be described
probabilistically. To this end, Bayesian inference methods focus on producing
parameter distributions that represent all the knowledge we can extract from the
sample and from prior information about the problem.
As we have said, the ultimate objective of statistical inference, if we adopt the fre-
quentist approach, is to produce probable propositions concerning population param-
eters from analysis of a sample. The most important classes of propositions are as
follows:
• Propositions about point estimates. A point estimate is a particular value that best
approximates some parameter of interest. For example, the mean or the variance
of the sample.
• Propositions about confidence intervals or set estimates. A confidence interval is
a range of values that best represents some parameter of interest.
• Propositions about the acceptance or rejection of a hypothesis.
Estimates produced by descriptive statistics are not equal to the truth but they are
better as more data become available. So, it makes sense to use them as central
elements of our propositions and to measure its variability with respect to the sample
size.
4.3 Measuring the Variability in Estimates 53
1 http://opendata.bcn.cat/.
2 Suppose that we draw all possible samples of a given size from a given population. Suppose further
that we compute the mean for each sample. The probability distribution of this statistic is called the
mean sampling distribution.
54 4 Statistical Inference
Fig. 4.1 Empirical distribution of the sample mean. In red, the mean value of this distribution
In [2]: # population
df = a c c i d e n t s . t o _ f r a m e ()
N_test = 10000
e l e m e n t s = 200
# mean array of s a m p l e s
m e a n s = [0] * N _ t e s t
# sample generation
for i in range ( N _ t e s t ) :
r o w s = np . r a n d o m . c h o i c e ( df . i n d e x . values , e l e m e n t s )
s a m p l e d _ d f = df . ix [ rows ]
means [ i ] = s a m p l e d _ d f . mean ()
In general, given a point estimate from a sample of size n, we define its sampling
distribution as the distribution of the point estimate based on samples of size n
from its population. This definition is valid for point estimates of other population
parameters, such as the population median or population standard deviation, but we
will focus on the analysis of the sample mean.
The sampling distribution of an estimate plays an important role in understanding
the real meaning of propositions concerning point estimates. It is very useful to think
of a particular point estimate as being drawn from such a distribution.
print ’ D i r e c t e s t i m a t i o n of SE f r o m one s a m p l e of
200 e l e m e n t s : ’ , e s t _ s i g m a _ m e a n [0]
print ’ E s t i m a t i o n of the SE by s i m u l a t i n g 10000 s a m p l e s of
200 e l e m e n t s : ’ , np . array ( means ) . std ()
Fig. 4.2 Mean sampling distribution by bootstrapping. In red, the mean value of this distribution
In [4]: def m e a n B o o t s t r a p (X , n u m b e r b ) :
x = [0]* n u m b e r b
for i in range ( n u m b e r b ) :
sample = [X[j]
for j
in np . r a n d o m . r a n d i n t ( len ( X ) , size = len ( X ) )
]
x [ i ] = np . m e a n ( s a m p l e )
return x
m = meanBootstrap ( accidents , 10000)
print " Mean e s t i m a t e : " , np . mean ( m )
A point estimate Θ, such as the sample mean, provides a single plausible value for
a parameter. However, as we have seen, a point estimate is rarely perfect; usually
there is some error in the estimate. That is why we have suggested using the standard
error as a measure of its variability.
Instead of that, a next logical step would be to provide a plausible range of values
for the parameter. A plausible range of values for the sample parameter is called a
confidence interval.
1. Our point estimate is the most plausible value of the parameter, so it makes sense
to build the confidence interval around the point estimate.
2. The plausibility of a range of values can be defined from the sampling distribution
of the estimate.
For the case of the mean, the Central Limit Theorem states that its sampling
distribution is normal:
Theorem 4.1 Given a population with a finite mean μ and a finite non-zero variance
σ 2 , the sampling distribution of the mean approaches a normal distribution with a
mean of μ and a variance of σ 2 /n as n, the sample size, increases.
In this case, and in order to define an interval, we can make use of a well-known
result from probability that applies to normal distributions: roughly 95% of the time
our estimate will be within 1.96 standard errors of the true mean of the distribution.
If the interval spreads out 1.96 standard errors from a normally distributed point
estimate, intuitively we can say that we are roughly 95% confident that we have
captured the true parameter.
C I = [Θ − 1.96 × S E, Θ + 1.96 × S E]
In [5]: m = a c c i d e n t s . mean ()
se = a c c i d e n t s . std () / math . sqrt ( len ( a c c i d e n t s ) )
ci = [ m - se *1.96 , m + se * 1 . 9 6 ]
print " C o n f i d e n c e i n t e r v a l : " , ci
This is how we would compute a 95% confidence interval of the sample mean
using bootstrapping:
58 4 Statistical Inference
2. Calculate the mean of your s values of the sample statistic. This process gives
you a “bootstrapped” estimate of the sample statistic.
3. Calculate the standard deviation of your s values of the sample statistic. This
process gives you a “bootstrapped” estimate of the SE of the sample statistic.
4. Obtain the 2.5th and 97.5th percentiles of your s values of the sample statistic.
ci = [ np . p e r c e n t i l e (m , 2.5) , np . p e r c e n t i l e (m , 97.5) ]
print " C o n f i d e n c e i n t e r v a l : " , ci
In 95% of the cases, when I compute the 95% confidence interval from this sample, the true
mean of the population will fall within the interval defined by these bounds: ±1.96 × S E.
We cannot say either that our specific sample contains the true parameter or that
the interval has a 95% chance of containing the true parameter. That interpretation
would not be correct under the assumptions of traditional statistics.
4.4 Hypothesis Testing 59
• H0 : The mean number of daily traffic accidents is the same in 2010 and 2013
(there is only one population, one true mean, and 2010 and 2013 are just different
samples from the same population).
• H A : The mean number of daily traffic accidents in 2010 and 2013 is different
(2010 and 2013 are two samples from two different populations).
Fig. 4.3 This graph shows 100 sample means (green points) and its corresponding confidence
intervals, computed from 100 different samples of 100 elements from our dataset. It can be observed
that a few of them (those in red) do not contain the mean of the population (black horizontal line)
60 4 Statistical Inference
We call H0 the null hypothesis and it represents a skeptical point of view: the
effect we have observed is due to chance (due to the specific sample bias). H A is the
alternative hypothesis and it represents the other point of view: the effect is real.
The general rule of frequentist hypothesis testing: we will not discard H0 (and
hence we will not consider H A ) unless the observed effect is implausible under H0 .
We can use the concept represented by confidence intervals to measure the plausi-
bility of a hypothesis.
We can illustrate the evaluation of the hypothesis setup by comparing the mean
rate of traffic accidents in Barcelona during 2010 and 2013:
In [7]: data = pd . r e a d _ c s v ( " files / ch04 / A C C I D E N T S _ G U _ B C N _ 2 0 1 0 . csv " ,
e n c o d i n g = ’ latin -1 ’)
# C r e a t e a new c o l u m n w h i c h is the d a t e
data [ ’ Date ’ ] = data [ ’ Dia de mes ’ ]. a p p l y ( l a m b d a x : str ( x ) )
+ ’-’ +
data [ ’ Mes de any ’ ]. a p p l y ( l a m b d a x : str ( x ) )
data2 = data [ ’ Date ’ ]
c o u n t s 2 0 1 0 = data [ ’ Date ’ ]. v a l u e _ c o u n t s ()
p r i n t ’ 2010: Mean ’ , c o u n t s 2 0 1 0 . mean ()
# C r e a t e a new c o l u m n w h i c h is the d a t e
data [ ’ Date ’ ] = data [ ’ Dia de mes ’ ]. a p p l y ( l a m b d a x : str ( x ) )
+ ’-’ +
data [ ’ Mes de any ’ ]. a p p l y ( l a m b d a x : str ( x ) )
data2 = data [ ’ Date ’ ]
c o u n t s 2 0 1 3 = data [ ’ Date ’ ]. v a l u e _ c o u n t s ()
p r i n t ’ 2013: Mean ’ , c o u n t s 2 0 1 3 . mean ()
This estimate suggests that in 2013 the mean rate of traffic accidents in Barcelona
was higher than it was in 2010. But is this effect statistically significant?
Based on our sample, the 95% confidence interval for the mean rate of traffic
accidents in Barcelona during 2013 can be calculated as follows:
In [8]: n = len ( c o u n t s 2 0 1 3 )
mean = c o u n t s 2 0 1 3 . mean ()
s = c o u n t s 2 0 1 3 . std ()
ci = [ m e a n - s * 1 . 9 6 / np . sqrt ( n ) , mean + s * 1 . 9 6 / np . s q r t ( n ) ]
print ’ 2010 a c c i d e n t rate e s t i m a t e : ’ , c o u n t s 2 0 1 0 . mean ()
print ’ 2013 a c c i d e n t rate e s t i m a t e : ’ , c o u n t s 2 0 1 3 . mean ()
print ’ CI for 2013: ’ , ci
4.4 Hypothesis Testing 61
If we use a 95% confidence interval to test a problem where the null hypothesis is true, we
will make an error whenever the point estimate is at least 1.96 standard errors away from the
population parameter. This happens about 5% of the time (2.5% in each tail).
• The first step is to quantify the size of the apparent effect by choosing a test statistic.
In our case, the apparent effect is a difference in accident rates, so a natural choice
for the test statistic is the difference in means between the two periods.
62 4 Statistical Inference
• The second step is to define a null hypothesis, which is a model of the system
based on the assumption that the apparent effect is not real. In our case, the null
hypothesis is that there is no difference between the two periods.
• The third step is to compute a p-value, which is the probability of seeing the
apparent effect if the null hypothesis is true. In our case, we would compute the
difference in means, then compute the probability of seeing a difference as big, or
bigger, under the null hypothesis.
• The last step is to interpret the result. If the p-value is low, the effect is said to be
statistically significant, which means that it is unlikely to have occurred by chance.
In this case we infer that the effect is more likely to appear in the larger population.
1. Pool the distributions, generate samples with size n and compute the difference
in the mean.
2. Generate samples with size n and compute the difference in the mean.
3. Count how many differences are larger than the observed one.
In [10]:
# pooling distributions
x = counts2010
y = counts2013
pool = np . c o n c a t e n a t e ([ x , y ])
np . r a n d o m . s h u f f l e ( pool )
# sample generation
import random
N = 1 0 0 0 0 # n u m b e r of s a m p l e s
diff = r a n g e ( N )
for i in r a n g e ( N ) :
p1 = [ r a n d o m . c h o i c e ( p o o l ) for _ in x r a n g e ( n ) ]
p2 = [ r a n d o m . c h o i c e ( p o o l ) for _ in x r a n g e ( n ) ]
diff [ i ] = ( np . mean ( p1 ) - np . mean ( p2 ) )
4.4 Hypothesis Testing 63
We do not yet have an answer for this question! We have defined a null hypothesis
H0 (the effect is not real) and we have computed the probability of the observed
effect under the null hypothesis, which is P(E|H0 ), where E is an effect as big as
or bigger than the apparent effect and a p-value .
We have stated that from the frequentist point of view, we cannot consider H A
unless P(E|H0 ) is less than an arbitrary value. But the real answer to this question
must be based on comparing P(H0 |E) to P(H A |E), not on P(E|H0 )! One possi-
ble solution to these problems is to use Bayesian reasoning; an alternative to the
frequentist approach.
No matter how many data you have, you will still depend on intuition to decide
how to interpret, explain, and use that data. Data cannot speak by themselves. Data
scientists are interpreters, offering one interpretation of what the useful narrative
story derived from the data is, if there is one at all.
4.6 Conclusions
In this chapter we have seen how we can approach the problem of making probable
propositions regarding population parameters.
We have learned that in some cases, there are theoretical results that allow us to
compute a measure of the variability of our estimates. We have called this approach
the “traditional approach”. Within this framework, we have seen that the sampling
distribution of our parameter of interest is the most important concept when under-
standing the real meaning of propositions concerning parameters.
We have also learned that the traditional approach is not the only alternative. The
“computationally intensive approach”, based on the bootstrap method, is a relatively
new approach that, based on intensive computer simulations, is capable of computing
a measure of the variability of our estimates by applying a resampling method to
our data sample. Bootstrapping can be used for computing variability of almost any
function of our data, with its only downside being the need for greater computational
resources.
We have seen that propositions about parameters can be classified into three
classes: propositions about point estimates, propositions about set estimates, and
propositions about the acceptance or the rejection of a hypothesis. All these classes
are related; but today, set estimates and hypothesis testing are the most preferred.
References 65
Finally, we have shown that the production of probable propositions is not error
free, even in the presence of big data. For these reason, data scientists cannot forget
that after any inference task, they must take decisions regarding the final interpretation
of the data.
Acknowledgements This chapter was co-written by Jordi Vitrià and Sergio Escalera.
References
1. M.I. Jordan. Are you a Bayesian or a frequentist? [Video Lecture]. Published: Nov. 2, 2009,
Recorded: September 2009. Retrieved from: http://videolectures.net/mlss09uk_jordan_bfway/
2. B. Efron, R.J. Tibshirani, An introduction to the bootstrap (CRC press, 1994)