Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
389 views

Introduction To Statistics and Data Analysis

Uploaded by

Leon Patiño
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
389 views

Introduction To Statistics and Data Analysis

Uploaded by

Leon Patiño
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 567

Christian Heumann, Michael Schomaker and Shalabh

Introduction to Statistics and Data


Analysis
With Exercises, Solutions and Applications in R
Christian Heumann
Department of Statistics, Ludwig-Maximilians-Universität München, München,
Germany

Michael Schomaker
Centre for Infectious Disease Epidemiology and Research, University of Cape
Town, Cape Town, South Africa

Shalabh
Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

ISBN 978-3-319-46160-1 e-ISBN 978-3-319-46162-5


DOI 10.1007/978-3-319-46162-5

Library of Congress Control Number: 2016955516

© Springer International Publishing Switzerland 2016

This work is subject to copyright. All rights are reserved by the Publisher,
whether the whole or part of the material is concerned, specifically the rights of
translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or
by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service


marks, etc. in this publication does not imply, even in the absence of a specific
statement, that such names are exempt from the relevant protective laws and
regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and
information in this book are believed to be true and accurate at the date of
publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The success of the open-source statistical software “ R ” has made a significant
impact on the teaching and research of statistics in the last decade. Analysing
data is now easier and more affordable than ever, but choosing the most
appropriate statistical methods remains a challenge for many users. To
understand and interpret software output, it is necessary to engage with the
fundamentals of statistics.
However, many readers do not feel comfortable with complicated
mathematics. In this book, we attempt to find a healthy balance between
explaining statistical concepts comprehensively and showing their application
and interpretation using R .
This book will benefit beginners and self-learners from various backgrounds
as we complement each chapter with various exercises and detailed and
comprehensible solutions. The results involving mathematics and rigorous
proofs are separated from the main text, where possible, and are kept in an
appendix for interested readers. Our textbook covers material that is generally
taught in introductory-level statistics courses to students from various
backgrounds, including sociology, biology, economics, psychology, medicine,
and others. Most often, we introduce the statistical concepts using examples and
illustrate the calculations both manually and using R .
However, while we provide a gentle introduction to R (in the appendix), this
is not a software book. Our emphasis lies on explaining statistical concepts
correctly and comprehensively, using exercises and software to delve deeper into
the subject matter and learn about the conceptual challenges that the methods
present.
This book’s homepage, http://chris.userweb.mwn.de/book/ , contains
additional material, most notably the software codes needed to answer the
software exercises, and data sets. In the remainder of this book, we will use grey
boxes

to introduce the relevant R commands. In many cases, the code can be directly
pasted into R to reproduce the results and graphs presented in the book; in
others, the code is abbreviated to improve readability and clarity, and the
detailed code can be found online.
Many years of teaching experience, from undergraduate to postgraduate
level, went into this book. The authors hope that the reader will enjoy reading it
and find it a useful reference for learning. We welcome critical feedback to
improve future editions of this book. Comments can be sent to
christian.heumann@stat.uni-muenchen.de ,
shalab@iitk.ac.in , and michael.schomaker@uct.ac.za who
contributed equally to this book.
We thank Melanie Schomaker for producing some of the figures and giving
graphical advice, Alice Blanck from Springer for her continuous help and
support, and Lyn Imeson for her dedicated commitment which improved the
earlier versions of this book. We are grateful to our families who have supported
us during the preparation of this book.
Christian Heumann
Michael Schomaker
Shalabh
München, Germany, Cape Town, South Africa, Kanpur, India
November 2016
Contents
Part I Descriptive Statistics

1 Introduction and Framework

1.1 Population, Sample, and Observations

1.2 Variables

1.2.1 Qualitative and Quantitative Variables

1.2.2 Discrete and Continuous Variables

1.2.3 Scales

1.2.4 Grouped Data

1.3 Data Collection

1.4 Creating a Data Set

1.4.1 Statistical Software

1.5 Key Points and Further Issues

1.6 Exercises

2 Frequency Measures and Graphical Representation of Data

2.1 Absolute and Relative Frequencies

2.2 Empirical Cumulative Distribution Function

2.2.1 ECDF for Ordinal Variables

2.2.2 ECDF for Continuous Variables

2.3 Graphical Representation of a Variable


2.3.1 Bar Chart

2.3.2 Pie Chart

2.3.3 Histogram

2.4 Kernel Density Plots

2.5 Key Points and Further Issues

2.6 Exercises

3 Measures of Central Tendency and Dispersion

3.1 Measures of Central Tendency

3.1.1 Arithmetic Mean

3.1.2 Median and Quantiles

3.1.3 Quantile–Quantile Plots (QQ-Plots)

3.1.4 Mode

3.1.5 Geometric Mean

3.1.6 Harmonic Mean

3.2 Measures of Dispersion

3.2.1 Range and Interquartile Range

3.2.2 Absolute Deviation, Variance, and Standard Deviation

3.2.3 Coefficient of Variation

3.3 Box Plots

3.4 Measures of Concentration

3.4.1 Lorenz Curve


3.4.2 Gini Coefficient

3.5 Key Points and Further Issues

3.6 Exercises

4 Association of Two Variables

4.1 Summarizing the Distribution of Two Discrete Variables

4.1.1 Contingency Tables for Discrete Data

4.1.2 Joint, Marginal, and Conditional Frequency Distributions

4.1.3 Graphical Representation of Two Nominal or Ordinal Variables

4.2 Measures of Association for Two Discrete Variables

4.2.1 Pearson’s Statistic

4.2.2 Cramer’s Statistic

4.2.3 Contingency Coefficient

4.2.4 Relative Risks and Odds Ratios

4.3 Association Between Ordinal and Continuous Variables

4.3.1 Graphical Representation of Two Continuous Variables

4.3.2 Correlation Coefficient

4.3.3 Spearman’s Rank Correlation Coefficient

4.3.4 Measures Using Discordant and Concordant Pairs

4.4 Visualization of Variables from Different Scales

4.5 Key Points and Further Issues


4.6 Exercises

Part II Probability Calculus

5 Combinatorics

5.1 Introduction

5.2 Permutations

5.2.1 Permutations without Replacement

5.2.2 Permutations with Replacement

5.3 Combinations

5.3.1 Combinations without Replacement and without Consideration


of the Order

5.3.2 Combinations without Replacement and with Consideration of


the Order

5.3.3 Combinations with Replacement and without Consideration of


the Order

5.3.4 Combinations with Replacement and with Consideration of the


Order

5.4 Key Points and Further Issues

5.5 Exercises

6 Elements of Probability Theory

6.1 Basic Concepts and Set Theory

6.2 Relative Frequency and Laplace Probability

6.3 The Axiomatic Definition of Probability

6.3.1 Corollaries Following from Kolomogorov’s Axioms


6.3.2 Calculation Rules for Probabilities

6.4 Conditional Probability

6.4.1 Bayes’ Theorem

6.5 Independence

6.6 Key Points and Further Issues

6.7 Exercises

7 Random Variables

7.1 Random Variables

7.2 Cumulative Distribution Function (CDF)

7.2.1 CDF of Continuous Random Variables

7.2.2 CDF of Discrete Random Variables

7.3 Expectation and Variance of a Random Variable

7.3.1 Expectation

7.3.2 Variance

7.3.3 Quantiles of a Distribution

7.3.4 Standardization

7.4 Tschebyschev’s Inequality

7.5 Bivariate Random Variables

7.6 Calculation Rules for Expectation and Variance

7.6.1 Expectation and Variance of the Arithmetic Mean

7.7 Covariance and Correlation


7.7.1 Covariance

7.7.2 Correlation Coefficient

7.8 Key Points and Further Issues

7.9 Exercises

8 Probability Distributions

8.1 Standard Discrete Distributions

8.1.1 Discrete Uniform Distribution

8.1.2 Degenerate Distribution

8.1.3 Bernoulli Distribution

8.1.4 Binomial Distribution

8.1.5 Poisson Distribution

8.1.6 Multinomial Distribution

8.1.7 Geometric Distribution

8.1.8 Hypergeometric Distribution

8.2 Standard Continuous Distributions

8.2.1 Continuous Uniform Distribution

8.2.2 Normal Distribution

8.2.3 Exponential Distribution

8.3 Sampling Distributions

8.3.1 -Distribution

8.3.2 t -Distribution
8.3.3 -Distribution

8.4 Key Points and Further Issues

8.5 Exercises

Part III Inductive Statistics

9 Inference

9.1 Introduction

9.2 Properties of Point Estimators

9.2.1 Unbiasedness and Efficiency

9.2.2 Consistency of Estimators

9.2.3 Sufficiency of Estimators

9.3 Point Estimation

9.3.1 Maximum Likelihood Estimation

9.3.2 Method of Moments

9.4 Interval Estimation

9.4.1 Introduction

9.4.2 Confidence Interval for the Mean of a Normal Distribution

9.4.3 Confidence Interval for a Binomial Probability

9.4.4 Confidence Interval for the Odds Ratio

9.5 Sample Size Determinations

9.6 Key Points and Further Issues

9.7 Exercises
10 Hypothesis Testing

10.1 Introduction

10.2 Basic Definitions

10.2.1 One- and Two-Sample Problems

10.2.2 Hypotheses

10.2.3 One- and Two-Sided Tests

10.2.4 Type I and Type II Error

10.2.5 How to Conduct a Statistical Test

10.2.6 Test Decisions Using the -Value

10.2.7 Test Decisions Using Confidence Intervals

10.3 Parametric Tests for Location Parameters

10.3.1 Test for the Mean When the Variance is Known (One-Sample
Gauss Test)

10.3.2 Test for the Mean When the Variance is Unknown (One-
Sample -Test)

10.3.3 Comparing the Means of Two Independent Samples

10.3.4 Test for Comparing the Means of Two Dependent Samples


(Paired t-Test)

10.4 Parametric Tests for Probabilities

10.4.1 One-Sample Binomial Test for the Probability

10.4.2 Two-Sample Binomial Test

10.5 Tests for Scale Parameters


10.6 Wilcoxon–Mann–Whitney (WMW) U-Test

10.7 -Goodness-of-Fit Test

10.8 -Independence Test and Other -Tests

10.9 Key Points and Further Issues

10.10 Exercises

11 Linear Regression

11.1 The Linear Model

11.2 Method of Least Squares

11.2.1 Properties of the Linear Regression Line

11.3 Goodness of Fit

11.4 Linear Regression with a Binary Covariate

11.5 Linear Regression with a Transformed Covariate

11.6 Linear Regression with Multiple Covariates

11.6.1 Matrix Notation

11.6.2 Categorical Covariates

11.6.3 Transformations

11.7 The Inductive View of Linear Regression

11.7.1 Properties of Least Squares and Maximum Likelihood


Estimators

11.7.2 The ANOVA Table

11.7.3 Interactions
11.8 Comparing Different Models

11.9 Checking Model Assumptions

11.10 Association Versus Causation

11.11 Key Points and Further Issues

11.12 Exercises

Appendix A: Introduction to R

Appendix B: Solutions to Exercises

Appendix C: Technical Appendix

Appendix D: Visual Summaries

References

Index
About the Authors
Christian Heumann
is a professor at the Ludwig-Maximilians-Universität München, Germany, where
he teaches students in Bachelor and Master programs offered by the Department
of Statistics, as well as undergraduate students in the Bachelor of Science
programs in business administration and economics. His research interests
include statistical modeling, computational statistics and all aspects of missing
data.

Michael Schomaker
is a Senior Researcher and Biostatistician at the Centre for Infectious Disease
Epidemiology & Research (CIDER), University of Cape Town, South Africa. He
received his doctoral degree from the University of Munich. He has taught
undergraduate students for many years and has written contributions for various
introductory textbooks. His research focuses on missing data, causal inference,
model averaging and HIV/AIDS.

Shalabh
is a Professor at the Indian Institute of Technology Kanpur, India. He received
his Ph.D. from the University of Lucknow (India) and completed his post-
doctoral work at the University of Pittsburgh (USA) and University of Munich
(Germany). He has over twenty years of experience in teaching and research. His
main research areas are linear models, regression analysis, econometrics,
measurement error models, missing data models and sampling theory.
Part I
Descriptive Statistics
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_1

1. Introduction and Framework


Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

Statistics is a collection of methods which help us to describe, summarize,


interpret, and analyse data. Drawing conclusions from data is vital in research,
administration, and business. Researchers are interested in understanding
whether a medical intervention helps in reducing the burden of a disease, how
personality relates to decision-making, whether a new fertilizer increases the
yield of crops, how a political system affects trade policy, who is going to vote
for a political party in the next election, what are the long-term changes in the
population of a fish species, and many more questions. Governments and
organizations may be interested in the life expectancy of a population, the risk
factors for infant mortality, geographical differences in energy usage, migration
patterns, or reasons for unemployment. In business, identifying people who may
be interested in a certain product, optimizing prices, and evaluating the
satisfaction of customers are possible areas of interest.
No matter what the question of interest is, it is important to collect data in a
way which allows its analysis. The representation of collected data in a data set
or data matrix allows the application of a variety of statistical methods. In the
first part of the book, we are going to introduce methods which help us in
describing data, and the second and third parts of the book focus on inferential
statistics, which means drawing conclusions from data. In this chapter, we are
going to introduce the framework of statistics which is needed to properly
collect, administer, evaluate, and analyse data.

1.1 Population, Sample, and Observations


Let us first introduce some terminology and related notations used in this book.
The units on which we measure data—such as persons, cars, animals, or plants
—are called observations . These units/observations are represented by the
Greek symbol . The collection of all units is called population and is
represented by . When we refer to , we mean a single unit out of all
units, e.g. one person out of all persons of interest. If we consider a selection of
observations , then these observations are called sample. A sample
is always a subset of the population, .

Example 1.1.1

If we are interested in the social conditions under which Indian people live,
then we would define all inhabitants of India as and each of its
inhabitants as . If we want to collect data from a few inhabitants, then
those would represent a sample from the total population.
Investigating the economic power of Africa’s platinum industry would
require to treat each platinum-related company as , whereas all platinum-
related companies would be collected in . A few companies
comprise a sample of all companies.
We may be interested in collecting information about those participating in
a statistics course. All participants in the course constitute the population
, and each participant refers to a unit or observation .

Remark 1.1.1
Sometimes, the concept of a population is not applicable or difficult to imagine.
As an example, imagine that we measure the temperature in New Delhi every
hour. A sample would then be the time series of temperatures in a specific time
window, for example from January to March 2016. A population in the sense of
observational units does not exist here. But now assume that we measure
temperatures in several different cities; then, all the cities form the population,
and a sample is any subset of the cities.

1.2 Variables
If we have specified the population of interest for a specific research question,
we can think of what is of interest about our observations. A particular feature of
these observations can be collected in a statistical variable X. Any information
we are interested in may be captured in such a variable. For example, if our
observations refer to human beings, X may describe marital status, gender, age,
or anything else which may relate to a person. Of course, we can be interested in
many different features, each of them collected in a different variable
. Each observation takes a particular value for X. If X refers to
gender, each observation, i.e. each person, has a particular value x which refers
to either “male” or “female”.
The formal definition of a variable is

(1.1)
This definition states that a variable X takes a value x for each observation
, whereby the number of possible values is contained in the set S.

Example 1.2.1

If X refers to gender, possible x-values are contained in .


Each observation is either male or female, and this information is
summarized in X.
Let X be the country of origin for a car. Possible values to be taken by an
observation (i.e. a car) are
.
A variable X which refers to age may take any value between 1 and 125.
Each person is assigned a value x which represents the age of this person.
1.2.1 Qualitative and Quantitative Variables
Qualitative variables are the variables which take values x that cannot be ordered
in a logical or natural way. For example,
the colour of the eye,
the name of a political party, and
the type of transport used to travel to work
are all qualitative variables. Neither is there any reason to list blue eyes before
brown eyes (or vice versa) nor does it make sense to list buses before trains (or
vice versa).
Quantitative variables represent measurable quantities. The values which
these variables can take can be ordered in a logical and natural way. Examples of
quantitative variables are
size of shoes,
price for houses,
number of semesters studied, and
weight of a person.

Remark 1.2.1
It is common to assign numbers to qualitative variables for practical purposes in
data analyses (see Sect. 1.4 for more detail). For instance, if we consider the
variable “gender”, then each observation can take either the “value” male or
female. We may decide to assign 1 to female and 0 to male and use these
numbers instead of the original categories. However, this is arbitrary, and we
could have also chosen “1” for male and “0” for female, or “2” for male and
“10” for female. There is no logical and natural order on how to arrange male
and female, and thus, the variable gender remains a qualitative variable, even
after using numbers for coding the values that X can take.

1.2.2 Discrete and Continuous Variables


Discrete variables are variables which can only take a finite number of values.
All qualitative variables are discrete, such as the colour of the eye or the region
of a country. But also quantitative variables can be discrete: the size of shoes or
the number of semesters studied would be discrete because the number of values
these variables can take is limited.
Variables which can take an infinite number of values are called continuous
variables . Examples are the time it takes to travel to university, the length of an
antelope, and the distance between two planets. Sometimes, it is said that
continuous variables are variables which are “measured rather than counted”.
This is a rather informal definition which helps to understand the difference
between discrete and continuous variables. The crucial point is that continuous
variables can, in theory, take an infinite number of values; for instance, the
height of a person may be recorded as 172 cm. However, the actual height on the
measuring tape might be 172.3 cm which was rounded off to 172 cm. If one had
a better measuring instrument, we may have obtained 172.342 cm. But the real
height of this person is a number with indefinitely many decimal places such as
172.342975328... cm. No matter what we eventually report or obtain, a variable
which can take an infinite amount of values is defined to be a continuous
variable.

1.2.3 Scales
The thoughts and considerations from above indicate that different variables
contain different amounts of information. A useful classification of these
considerations is given by the concept of the scale of a variable. This concept
will help us in the remainder of this book to identify which methods are the
appropriate ones to use in a particular setting.

Nominal scale. The values of a nominal variable cannot be ordered.


Examples are the gender of a person (male–female) or the status of an
application (pending–not pending).
Ordinal scale. The values of an ordinal variable can be ordered. However,
the differences between these values cannot be interpreted in a
meaningful way. For example, the possible values of education level
(none–primary education–secondary education–university degree) can
be ordered meaningfully, but the differences between these values cannot
be interpreted. Likewise, the satisfaction with a product (unsatisfied–
satisfied–very satisfied) is an ordinal variable because the values this
variable can take can be ordered, but the differences between
“unsatisfied–satisfied” and “satisfied–very satisfied” cannot be
compared in a numerical way.
Continuous scale. The values of a continuous variable can be ordered.
Furthermore, the differences between these values can be interpreted in a
meaningful way. For instance, the height of a person refers to a
continuous variable because the values can be ordered (170 cm, 171 cm,
172 cm, ...), and differences between these values can be compared (the
difference between 170 and 171 cm is the same as the difference
between 171 and 172 cm). Sometimes, the continuous scale is divided
further into subscales. While in the remainder of the book we typically
do not need these classifications, it is still useful to reflect on them:

Interval scale. Only differences between values, but not ratios, can be
interpreted. An example for this scale would be temperature (measured
in ): the difference between and is , but the ratio of
does not mean that is twice as cold as .
Ratio scale. Both differences and ratios can be interpreted. An example is
speed: 60 km/h is 40 km/h more than 20 km/h. Moreover, 60 km/h is
three times faster than 20 km/h because the ratio between them is 3.
Absolute scale. The absolute scale is the same as the ratio scale, with the
exception that the values are measured in “natural” units. An example is
“number of semesters studied” where no artificial unit such as or
is needed: the values are simply .

1.2.4 Grouped Data


Sometimes, data may be available only in a summarized form: instead of the
original value, one may only know the category or group the value belongs to.
For example,
it is often convenient in a survey to ask for the income (per year) by means
of groups: [€0–€20,000), [€20,000–€30,000), , > €100,000;
if there are many political parties in an election, those with a low number of
voters are often summarized in a new category “Other Parties”;
instead of capturing the number of claims made by an insurance company
customer, the variable “claimed” may denote whether or not the customer
claimed at all (yes–no).
If data is available in grouped form, we call the respective variable capturing
this information a grouped variable . Sometimes, these variables are also
known as categorical variables . This is, however, not a complete definition
because categorical variables refer to any type of variable which takes a finite,
possibly small, number of values. Thus, any discrete and/or nominal and/or
ordinal and/or qualitative variable may be regarded as a categorical variable.
Any grouped or categorical variable which can only take two values is called a
binary variable .
To gain a better understanding on how the definitions from the above
sections relate to each other see Fig. 1.1. Qualitative data is always discrete, but
quantitative data can be both discrete (e.g. size of shoes or a grouped variable)
and continuous (e.g. temperature). Nominal variables are always qualitative and
discrete (e.g. colour of the eye), whereas continuous variables are always
quantitative (e.g. temperature). Categorical variables can be both qualitative (e.g.
colour of the eye) and quantitative (satisfaction level on a scale from 1 to 5).
Categorical variables are never continuous.

Fig. 1.1 Summary of variable classifications

1.3 Data Collection


When collecting data, we may ask ourselves how to facilitate this in detail and
how much data needs to be collected. The latter question will be partly answered
in Sect. 9.5; but in general, we can think of collecting data either on all subjects
of interest, such as in a national census, or on a representative sample of the
population. Most commonly, we gather data on a sample (described in the Part I
of this book) and then draw conclusions about the population of interest
(discussed in the Part III of this book). A sample might either be chosen by us or
obtained through third parties (hospitals, government agencies), or created
during an experiment. This depends on the context as described below.
Survey. A survey typically (but not always) collects data by asking questions
(in person or by phone) or providing questionnaires to study participants (as a
printout or online). For example, an opinion poll before a national election
provides evidence about the future government: potential voters are asked by
phone which party they are going to vote for in the next election; on the day of
the election, this information can be updated by asking the same question to a
sample of voters who have just delivered their vote at the polling station (so-
called exit poll). A behavioural research survey may ask members of a
community about their knowledge and attitudes towards drug use. For this
purpose, the study coordinators can send people with a questionnaire to this
community and interview members of randomly selected households.
Ideally, a survey is conducted in a way which makes the chosen sample
representative of the population of interest. If a marketing company interviews
people in a pedestrian zone to find their views about a new chocolate bar, then
these people may not be representative of those who will potentially be
interested in this product. Similarly, if students are asked to fill in an online
survey to evaluate a lecture, it may turn out that those who participate are on
average less satisfied than those who do not. Survey sampling is a complex topic
on its own. The interested reader may consult Groves et al. (2009) or Kauermann
and Küchenhoff (2011).
Experiment. Experimental data is obtained in “controlled” settings. This can
mean many things, but essentially it is data which is generated by the researcher
with full control over one or many variables of interest. For instance, suppose
there are two competing toothpastes, both of which promise to reduce pain for
people with sensitive teeth. If the researcher decided to randomly assign
toothpaste A to half of the study participants, and toothpaste B to the other half,
then this is an experiment because it is only the researcher who decides which
toothpaste is to be used by any of the participants. It is not decided by the
participant. The data of the variable toothpaste is controlled by the experimenter.
Consider another example where the production process of a product can
potentially be reduced by combining two processes. The management could
decide to implement the new process in three production facilities, but leave it as
it is in the other facilities. The production process for the different units
(facilities) is therefore under control of the management. However, if each
facility could decide for themselves if they wanted a change or not, it would not
be an experiment because factors not directly controlled by the management,
such as the leadership style of the facility manager, would determine which
process is chosen.
Observational Data. Observational data is data which is collected routinely,
without a researcher designing a survey or conducting an experiment. Suppose a
blood sample is drawn from each patient with a particular acute infection when
they arrive at a hospital. This data may be stored in the hospital’s folders and
later accessed by a researcher who is interested in studying this infection. Or
suppose a government institution monitors where people live and move to. This
data can later be used to explore migration patterns.
Primary and Secondary Data. Primary data is data we collect ourselves,
i.e. via a survey or experiment. Secondary data, in contrast, is collected by
someone else. For example, data from a national census, publicly available
databases, previous research studies, government reports, historical data, and
data from the internet, among others, are secondary data.

1.4 Creating a Data Set


There is a unique way in which data is prepared and collected to utilize statistical
analyses. The data is stored in a data matrix ( data set) with p columns and n
rows (see Fig. 1.2). Each row corresponds to an observation/unit and each
column to a variable X. This means that, for example, the entry in the fourth row
and second column ( ) describes the value of the fourth observation on the
second variable. The examples below will illustrate the concept of a data set in
more detail.

Fig. 1.2 Data set or data matrix


Example 1.4.1
Suppose five students take examinations in music, mathematics, biology, and
geography. Their marks, measured on a scale between 0 and 100 (where 100 is
the best mark), can be written down as illustrated in Fig. 1.3. Note that each row
refers to a student and each column to a variable. We consider a larger data set in
the next example.

Fig. 1.3 Data set of marks of five students

Fig. 1.4 Pizza data set

Example 1.4.2
Consider the data set described in Appendix A.4. A pizza delivery service
captures information related to each delivery, for example the delivery time, the
temperature of the pizza, the name of the driver, the date of the delivery, the
name of the branch, and many more. To capture the data of all deliveries during
one month, we create a data matrix. Each row refers to a particular delivery,
therefore representing the observations of the data. Each column refers to a
variable. In Fig. 1.4, the variables (delivery time in minutes), (temperature
in ), and (name of branch) are listed.
The first row tells us about the features of the first pizza delivery: the
delivery time was 35.1 min, the pizza arrived with a temperature of , and
the pizza was delivered from the branch in the East of the city. In total, there
were deliveries. For nominal variables, such as branch, we may decide
to produce a coding list, as illustrated in Table 1.1: instead of referring to the
branches as “East”, “West”, and “Centre”, we may simply call them 1, 2, and 3.
As we will see in Chap. 11, this has benefits for some analysis methods, though
this is not needed in general.
Table 1.1 Coding list for branch
Variable Values Code
Branch East 1
West 2
Centre 3
Missing 4

If some values are missing, for example because they were never captured or
even lost, then this requires special attention. In Table 1.1, we assign missing
values the number “4” and therefore treat them as a separate category. If we
work with statistical software (see below), we may need other coding such as NA
in the statistical software R or in Stata. More detail can be found in Appendix A.

Another consideration when collecting data is that of transformations : we may


have captured the velocity of cars in kilometres per hour, but may need to
present the data in miles per hour; we have captured the temperature in degrees
Celsius, whereas we need to communicate results in degrees Fahrenheit, or we
have created a satisfaction score which we want to range from −5 to 5, while
the score currently runs from 0 to 20. This is not a problem at all. We can simply
create a new variable which reflects the required transformation. However, valid
transformations depend on the scale of a variable. Variables on an interval scale
can use transformations of the following kind:
(1.2)
For ratio scales, only the following transformations are valid:
(1.3)
In the above equation, a is set to 0 because ratios only stay the same if we
respect a variable’s natural point of origin.

Example 1.4.3
The temperature in relates to the temperature in as follows:

This means that relates to . If is a variable


representing temperature by , we can simply create a new variable which
is temperature in . Since temperature is measured on an interval scale, this
transformation is valid.
Changing currencies is also possible. If we would like to represent the price
of a product not in South African Rand but in €, we simply apply the
transformation
whereby b is the currency exchange rate.

1.4.1 Statistical Software


There are number of statistical software packages which allow data collection,
management, and–most importantly–analysis. In this book, we focus on the
statistical software R which is freely available at http://cran.r-project.org/. A
gentle introduction to R is provided in Appendix A. A data matrix can be created
manually using commands such as matrix(), data.frame(), and others.
Any data can be edited using edit(). However, typically analysts have already
typed their data into databases or spreadsheets, for example in Excel, Access, or
MySQL. In most of these applications, it is possible to save the data as an ASCII
file (.dat), as a tab-delimited file (.txt), or as a comma-separated values file
(.csv). All of these formats allow easy switching between different software and
database applications. Such data can easily be read into R by means of the
following commands:

where setwd specifies the working directory. Alternatively, loading the


library foreign allows the import of data from many different statistical
software packages, notably Stata, SAS, Minitab, SPSS, among others. A detailed
description of data import and export can be found in the respective R manual
available at http://cran.r-project.org/doc/manuals/r-release/R-data.pdf. Once the
data is read into R, it can be viewed with
We can also can get an overview of the data directly in the R-console by
displaying only the top lines of the data with head(). Both approaches are
visualized in Fig. 1.5 for the pizza data introduced in Example 1.4.2.

Fig. 1.5 Viewing data in R

1.5 Key Points and Further Issues


1.6 Exercises
Exercise 1.1
Describe both the population and the observations for the following research
questions:

(a) Evaluation of the satisfaction of employees from an airline.

(b) Description of the marks of students from an assignment.

(c) Comparison of two drugs which deal with high blood pressure.

Exercise 1.2
A national park conducts a study on the behaviour of their leopards. A few of the
park’s leopards are registered and receive a GPS device which allows measuring
the position of the leopard. Use this example to describe the following concepts:
population, sample, observation, value, and variable.

Exercise 1.3
Which of the following variables are qualitative, and which are quantitative?
Specify which of the quantitative variables are discrete and which are
continuous:
Time to travel to work, shoe size, preferred political party, price for a
canteen meal, eye colour, gender, wavelength of light, customer satisfaction
on a scale from 1 to 10, delivery time for a parcel, blood type, number of
goals in a hockey match, height of a child, subject line of an email.

Exercise 1.4
Identify the scale of the following variables:

(a) Political party voted for in an election

(b) The difficulty of different levels in a computer game

(c) Production time of a car

(d) Age of turtles

(e) Calender year

(f) Price of a chocolate bar

(g) Identification number of a student

(h) Final ranking at a beauty contest

(i) Intelligence quotient.

Exercise 1.5
Make yourself familiar with the pizza data set from Appendix A.4.

(a) First, browse through the introduction to R in Appendix A. Then, read in


the data.

(b) View the data both in the R data editor and in the R console.
(c) Create a new data matrix which consists of the first 5 rows and first 5
variables of the data. Print this data set on the R console. Now, save this
data set in your preferred format.

(d) Add a new variable “NewTemperature” to the data set which converts the
temperature from to .

(e) Attach the data and list the values from the variable “NewTemperature”.

(f) Use “?” to make yourself familiar with the following commands: str,
dim, colnames, names, nrow, ncol, head, and tail.
Apply these commands to the data to get more information about it.

Exercise 1.6
Consider the research questions of describing parents’ attitudes towards
immunization, what proportion of them wants immunization against chicken pox
for their last-born child, and whether this proportion differs by gender and age.

(a) Which data collection method is the most suitable one to answer the above
questions: survey or experiment?

(b) How would you capture the attitudes towards immunization in a single
variable?

(c) Which variables are needed to answer all the above questions? Describe the
scale of each of them.

(d) Reflect on what an appropriate data set would look like. Now, given this
data set, try to write down the above research questions as precisely as
possible.
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_2

2. Frequency Measures and Graphical


Representation of Data
Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

In Chap. 1, we highlighted that different variables contain different levels of


information. When summarizing or visualizing one or more variable(s), it is this
information which determines the appropriate statistical methods to use.
Suppose we are interested in studying the employment opportunities and
starting salaries of university graduates with a master’s degree. Let the variable
X denote the starting salaries measured in €/year. Now suppose 100 graduate
students provide their initial salaries. Let us write down the salary of the first
student as , the salary of the second student as , and so on. We therefore
have 100 observations . How can we summarize those 100 values
best to extract meaningful information from them? The answer to this question
depends upon several aspects like the nature of the recorded data, e.g. how many
observations have been obtained (either small in number or large in number) or
how the data was recorded (either exact values were obtained or the values were
obtained in intervals). For example, the starting salaries may be obtained as
exact values, say 51,500 €/year, 32,350 €/year, etc. Alternatively, these values
could have been summarized in categories such as low income (<30,000 €/year),
medium income (30,000–50,000 €/year), high income (50,000–70,000 €/year),
and very high income (>70,000 €/year). Another approach is to ask whether the
students were employed or not after graduating and record the data in terms of
“yes” or “no”. It is evident that the latter classification is less detailed than the
grouped income data which is less detailed than the exact data. Depending on
which conceptualization of “starting salary” we use, we need to choose the
approach to summarize the data, that is the 100 values relating to the 100
graduated students.

2.1 Absolute and Relative Frequencies


Discrete Data. Let us first consider a simple example to illustrate our notation.

Example 2.1.1
Suppose there are ten people in a supermarket queue. Each of them is either
coded as “F” (if the person is female) or “M” (if the person is male). The
collected data may look like

There are now two categories in the data: male (M) and female (F). We use to
refer to the male category and to refer to the female category. Since there are
seven male and three female students, we have 7 values in category , denoted
as , and 3 values in category , denoted as . The number of
observations in a particular category is called the absolute frequency . It follows
that and are the absolute frequencies of and , respectively.
Note that , which is the same as the total number of collected
observations. We can also calculate the relative frequencies of and as
and respectively.
This gives us information about the proportions of male and female customers in
the queue.

We now extend these concepts to a general framework for the summary of data
on discrete variables. Suppose there are k categories denoted as
with observations in category . The absolute frequency is
defined as the number of units in the jth category . The sum of absolute
frequencies equals the total number of units in the data: . The relative
frequencies of the jth class are defined as

(2.1)
The relative frequencies always lie between 0 and 1 and .
Grouped Continuous Data. Data on continuous variables usually has a
large number (k) of different values. Sometimes k may even be the same as n and
in such a case the relative frequencies become for all j. However, it is
possible to define intervals in which the observed values are contained.

Example 2.1.2
Consider the following results of the written part of a driving licence
examination (a maximum of 100 points could be achieved):

We can summarize the results in class intervals such as 0–20, 21–40, 41–60, 61–
80, and 81–100, and the data can be presented as follows:

Class intervals 0–20 21–40 41–60 61–80 81–100


Absolute frequencies

Relative frequencies

We have and .

Now, suppose the n observations can be classified into k class intervals


, where contains observations with .
The relative frequency of the jth class is and . Table 2.1
displays the frequency distribution of a discrete variable X.
Table 2.1 Frequency distribution for discrete data

Class intervals ...

Absolute frequencies ...


Relative frequencies ...

Example 2.1.3
Consider the pizza delivery service data (Example 1.4.2, Appendix A.4). We are
interested in the pizza deliveries by branch and generate the respective frequency
table, showing the distribution of the data, using the table command in R
(after reading in and attaching the data) as

Centre East West

421 410 435

We have deliveries and . We can see from this table


that each branch has a similar absolute number of pizza deliveries and each
branch contributes to about one-third of the total number of deliveries.

2.2 Empirical Cumulative Distribution Function


Another approach to summarize and visualize the (frequency) distribution of
variables is the empirical cumulative distribution function, often abbreviated
as “ECDF”. As the name itself suggests, it gives us an idea about the cumulative
relative frequencies up to a certain point. For example, say we want to know
how many people scored up to 60 points in Example 2.1.2. Then, this can be
calculated by adding the number of people in the class intervals 0–20, 21–40,
and 41–60, which corresponds to and is the cumulative
frequency . If we want to know the relative frequency of people obtaining up to
60 points, we have to add the relative frequencies of the people in the class
intervals 0–20, 21–40, and 41–60 as .
Before discussing the empirical cumulative distribution function in a more
general framework, let us first understand the concept of ordered values.
Suppose the values of height of four people are observed as cm,
cm, cm, and cm. We arrange these values in an order,
say ascending order, i.e. first the smallest value (denoted as ) and lastly the
largest value (denoted as ). We obtain

The values , and are called ordered values for which


holds. Note that is not necessarily the smallest value but
is necessarily the smallest value. In general, if we have n observations
, then the ordered data is .
Consider n observations of a variable X, which are arranged in
ascending order as (and are thus on an at least ordinal scale).
The empirical cumulative distribution function F(x) is defined as the
cumulative relative frequencies of all values , which are smaller than, or equal
to, x:

(2.2)
This definition implies that F(x) is a monotonically non-decreasing function,
(the lower limit of F is 0), (the
upper limit of F is 1), and F(x) is right continuous.

2.2.1 ECDF for Ordinal Variables


The empirical cumulative distribution function of ordinal variables is a step
function .

Example 2.2.1
Consider a customer satisfaction survey from a car service company. The 200
customers who had a car service done within the last 30 days were asked to
respond regarding their overall level of satisfaction with the quality of the car
service on a scale from 1 to 5 based on the following options:

. Based on the frequency of each option, we can calculate the relative


frequencies and then plot the empirical cumulative distribution function, either
manually (takes longer) or by using R (quick):
Satisfaction level ( )

4 16 90 70 20

4 / 200 16 / 200 90 / 200 70 / 200 20 / 200

4 / 200 20 / 200 110 / 200 180 / 200 200 / 200

The ’s are calculated as follows:

The ECDF for this data can be obtained by summarizing the data in a vector
and using the plot.ecdf() function in R, see Fig. 2.1:

Fig. 2.1 ECDF for the satisfaction survey

The ECDF can be used to obtain the relative frequencies for values contained in
certain intervals as

It further follows that:


(2.3)
(2.4)
(2.5)
(2.6)
(2.7)
(2.8)
(2.9)
(2.10)

Example 2.2.2
Suppose, in Example 2.2.1, we want to know how many customers are not
satisfied with their car service. Then, using the data relating to the responses “1”
and “2”, we observe from the ECDF that of the customers
were not satisfied with the car service. This relates to using rule (2.3):
or 10 %. Similarly, the proportion of customers who are
more than satisfied can be obtained using (2.5) as
or 45 %.

2.2.2 ECDF for Continuous Variables


In general, we can apply formulae (2.2)–(2.10) to continuous data as well.
However, before demonstrating their use, let us consider a somewhat different
setting. Let us assume that a continuous variable of interest is only available in
the form of grouped data. We may assume that the observations within each
group, i.e. each category or each interval, are distributed uniformly over the
entire interval. The ECDF then consists of straight lines connecting the lower
and upper values of the ECDF in each of the intervals. To understand this
concept in more detail, we introduce the following notation:
Under the assumption that all values in a particular interval are distributed
uniformly within this interval, the empirical cumulative distribution function
relates to a polygonal chain connecting the points
. The ECDF can then be defined as

(2.11)

with . The idea behind (2.11) is presented in Fig. 2.2. For any interval
, the respective lower and upper limits of the ECDF are and .
If we assume values to be distributed uniformly over this interval, we can
connect and with a straight line. To obtain F(x) with and
, we simply add the height of the ECDF between and F(x) to
.

Fig. 2.2 Illustration of the ECDF for continuous data available in groups/intervals

Table 2.2 The values needed to calculate the ECDF for the grouped pizza delivery time data in Example
2.2.3

Delivery time j

[0; 10] 1 0 10 0 0.0000 0.0000


(10; 15] 2 10 15 3 0.0024 0.0024
(15; 20] 3 15 20 21 0.0166 0.0190
(20; 25] 4 20 25 75 0.0592 0.0782
(25; 30] 5 25 30 215 0.1698 0.2480
(30; 35] 6 30 35 373 0.2946 0.5426
(35; 40] 7 35 40 350 0.2765 0.8191
(40; 45] 8 40 45 171 0.1351 0.9542
(45; 50] 9 45 50 52 0.0411 0.9953
(50; 55] 10 50 55 6 0.0047 1.0000

Example 2.2.3
Consider Example 2.1.3 of the pizza delivery service. Suppose we are interested
in determining the distribution of the pizza delivery times. Using the function
plot.ecdf() in R, we obtain the ECDF of the continuous data, see Fig. 2.3a.
Note that the structure of the curve is a step function but now almost looks like a
continuous curve. The reason for this is that when the number of observations is
large, then the lengths of class intervals become small. When these small lengths
are joined together, they appear like a continuous curve. As the number of
observations increases, the smoothness of the curve increases too. If the number
of observations is not large, e.g. suppose the data is reported as a summary from
the drivers, i.e. whether the delivery took <15 min, between 15 and 20 min,
between 20 and 25 min, and so on, then we can construct the ECDF by creating
a table summarizing the data features as in Table 2.2.
Figure 2.3b shows the ECDF based on the grouped data evaluated in
Table 2.2. It is interesting to see that the graphs emerging from the use of the
grouped data and ungrouped data are similar in this specific example.
Suppose we are interested in calculating how many deliveries were
completed within the desired time limit of 30 min, with a tolerance of maximum
10 % deviation, i.e. a deviation of 3 min. We can evaluate the ECDF at
min. Based on (2.11), we calculate
. Thus,
we conclude, based on the grouped data, that only about 42 % of the deliveries
were completed in the desired time frame.

2.3 Graphical Representation of a Variable


Frequency tables and empirical cumulative distribution functions are useful in
providing a numerical summary of a variable. Graphs are an alternative way to
summarize a variable’s information. In many situations, they have the advantage
of conveying the information hidden in the data more compactly. Similarly,
someone’s mood can be more easily understood when looking at a smiley than
by reading an essay about one’s mood in a long paragraph.
Fig. 2.3 Empirical cumulative distribution function for pizza delivery time

2.3.1 Bar Chart


A simple tool to visualize the relative or absolute frequencies of observed values
of a variable is a bar chart . A bar chart can be used for nominal and ordinal
variables, as long as the number of categories is not very large. It consists of one
bar for each category. The height of each bar is determined by either the absolute
frequency or the relative frequency of the respective category and is shown on
the y-axis. If the variable is measured on an ordinal level, then it is
recommended to arrange the bars on the x-axis according to their ranks or
values. If the number of categories is large, then the number of bars will be large
too and the bar chart, in turn, may not remain informative.

Example 2.3.1
Consider Example 2.1.1 in which ten people, queueing in a supermarket, were
classified as being either male (M) or female (F). The absolute frequencies for
males and females are and , respectively. Since there are two
categories, M and F, two bars are needed to construct the chart—one for the
male category and another for the female category. The heights of the bars are
determined as either and or and . These graphs are
shown in Fig. 2.4.
Fig. 2.4 Bar charts

Example 2.3.2
Consider the data in Example 2.1.3, where the pizza delivery times for each
branch are recorded over a period of 1 month. The frequency table forms the
basis for the bar chart, either using the absolute or relative frequencies on the y-
axis. Figure 2.5 shows the bar charts for the number and proportion of pizza
deliveries per branch. The graphs can be produced in R by applying the
barplot command to a frequency table:

Remark 2.3.1
Instead of vertical bars, horizontal bars can be drawn using the optional
argument horiz=TRUE in the barplot command.
Fig. 2.5 Bar charts for the pizza deliveries per branch

2.3.2 Pie Chart


Pie charts are another option to visualize the absolute and relative frequencies of
nominal and ordinal variables. A pie chart is a circle partitioned into segments,
where each of the segments represents a category. The size of each segment
depends upon the relative frequency and is determined by the angle .

Example 2.3.3
To illustrate the construction of a pie chart, let us consider again Example 2.1.1
in which ten people in a supermarket queue were classified as being either male
(M) or female (F): M, F, M, F, M, M, M, F, M, M. The pie chart for this data will
have two segments: one for males and another one for females. The relative
frequencies are and , respectively. The size of the segment for
the first category (M) is , and the size of the segment
for the second category (F) is . The pie chart is
shown in Fig. 2.6a.

Example 2.3.4
Consider again Example 2.2.1, where 200 customers were asked about their
level of satisfaction (5 categories) with their car service. The pie chart for this
example consists of five segments representing the categories 1, 2, 3, 4, and 5.
The size of the segment is . For example, for category 1,
there are 4 out of 200 customers, who are not satisfied at all. The angle of the
segment “not satisfied at all” therefore is . Similarly,
we can calculate the angle of the other segments and obtain a pie chart as shown
in Fig. 2.6b using the pie command in R

Fig. 2.6 Pie charts

Remark 2.3.2
Note that the area of a segment is not proportional to the absolute frequency of
the respective category. Instead, the area of the segment is proportional to the
angle (and depends also on the radius of the whole circle). It has been
argued that this may cause improper interpretations as the human eye may catch
the segment’s area more easily than the angle of a segment. Pie charts should
therefore be used with caution.

2.3.3 Histogram
If a variable consists of a large number of different values, the number of
categories used to construct bar charts will consequently be large too. A bar chart
may thus not give a clear summary when applied to a continuous variable.
Instead, a histogram is the appropriate choice to represent the distribution of
values of continuous variables. It is based on the idea to categorize the data into
different groups and plot the bars for each category with height , where
denotes the width of the class interval or category. An important
consideration for this concept is that the area of the bars ( height width) is
proportional to the relative frequency. This means that the widths of the bars
need not necessarily to be the same because different widths can be adjusted
with different heights of the bars.

Example 2.3.5
Consider Example 2.1.2, where people were divided into five class
intervals 0–20, 21–40, 41–60, 61–80, and 81–100 based on their performance in
a written driving licence examination. The frequency table is given as

Class intervals 0–20 21–40 41–60 61–80 81–100


Absolute freq

Relative freq

Height

The histogram for this grouped data set has five categories and therefore it
has five bars. Since the widths of class intervals are the same, the heights of the
bars are proportional to the relative frequency of the respective category. The
resulting histogram is displayed in Fig. 2.7.

Fig. 2.7 Histogram for the scores of the people

Example 2.3.6
Recall Example 2.2.3 and the variable “pizza delivery time”. Table 2.3 shows the
summary of the grouped data and the values needed to calculate the histogram.
Figure 2.8a shows the histogram with equal widths of delivery time intervals.
We see a symmetric distribution of the pizza delivery times, but many delivery
times exceeding the target time of 30 min. If the histogram is required to have
different widths for different bars, i.e. different delivery time intervals for
different categories, then it can also be constructed as shown in Fig. 2.8b. This
representation is different from Fig. 2.8a. The following commands in R are used
to construct the histograms for absolute and relative frequencies, respectively:

Table 2.3 Values needed to calculate the histogram for the grouped pizza delivery time data
Delivery time j

[0; 10] 1 0 10 10 0.0000 0.00000


(10; 15] 2 10 15 5 0.0024 0.00047
(15; 20] 3 15 20 5 0.0166 0.00332
(20; 25] 4 20 25 5 0.0592 0.01185
(25; 30] 5 25 30 5 0.1698 0.03397
(30; 35] 6 30 35 5 0.2946 0.05893
(35; 40] 7 35 40 5 0.2765 0.05529
(40; 45] 8 40 45 5 0.1351 0.02701
(45; 50] 9 45 50 5 0.0411 0.00821
(50; 55] 10 50 55 5 0.0047 0.00094
Fig. 2.8 Histogram for pizza delivery time

Remark 2.3.3
The R command truehist() from the library MASS presents an alternative to
the hist() command. The default specifications are somewhat different, and
many users prefer it to the command hist.

2.4 Kernel Density Plots


A disadvantage of histograms is that continuous data is categorized artificially.
The choice of the class intervals is crucial for the final look of the graph. A more
elegant way to deal with this problem is to smooth the histogram in the sense
that each observation may contribute to different classes with different weights,
and the distribution is represented by a continuous function rather than a step
function. A kernel density plot can be produced by using the following
function:

(2.12)
where n is the sample size, h is the bandwidth, and K is a kernel function, for
example

To better understand this concept, consider Fig. 2.9a. The tick marks on the x-
axis represent five observations: 3, 6, 7, 8, and 10. On each observation as
well as its surrounding values, we apply a kernel function, which is the
Epanechnikov kernel in the figure. This means that we have five functions (grey,
dashed lines), which refer to the five observations. These functions are largest at
the observation itself and become gradually smaller as the distance from the
observation increases. Summing up the functions, as described in Eq. (2.12),
yields the solid black line, which is the kernel density plot of the five
observations. It is a smooth curve, which represents the data distribution. The
degree of smoothness can be controlled by the bandwidth h, which is chosen as 2
in Fig. 2.9a.
Fig. 2.9 Construction of kernel density plots

The choice of the kernel may affect the overall look of the plot. Above, we
have given the functions for the rectangular and Epanechnikov kernels.
However, another common function for kernel density plots is the normal
distribution function, which is introduced in Sect. 8.2.2, see Fig. 2.9b for a
comparison of different kernels. The kernel which is based on the normal
distribution is called the “Gaussian kernel” and is the default in R, where a
kernel density plot can be produced combining the plot and density
commands:

Please note that kernel functions are not defined arbitrarily and need to
satisfy certain conditions, such as those required for probability density
functions as explained in Chap. 7, Theorem 7.2.1.

Example 2.4.1
Let us consider the pizza data which we introduced earlier and in Appendix A.4.
We can summarize the delivery time by using a kernel density plot using the R
command plot(density(time)) and compare it with a histogram, see
Fig. 2.10a. We see that the delivery times are symmetric around 35 min. If we
shorten the bandwidth to a half of the default bandwidth (option adjust=0.5),
the kernel density plot becomes more wiggly, which is illustrated in Fig. 2.10b.
Fig. 2.10 Kernel density plot for delivery time

2.5 Key Points and Further Issues


2.6 Exercises
Exercise 2.1
Consider the results of the national elections in South Africa in 2014 and 2009:

Party Results 2014 (%) Results 2009 (%)


ANC (African National Congress) 62.15 65.90
DA (Democratic Alliance) 22.23 16.66
EFF (Economic Freedom Fighters) 6.35 –
IFP (Inkatha Freedom Party) 2.40 4.55
COPE (Congress of the People) 0.67 7.42
Others 6.20 5.47

(a) Summarize the results of the 2014 elections in a bar chart. Do it manually
and by using R.

(b) How would you compare the results of the 2009 and 2014 elections? Offer
a simple solution that can be represented in a single plot. Construct this
plot in R.

Exercise 2.2
Consider a variable X describing the time until the first goal was scored in the
matches of the 2006 football World Cup competition. Only matches with at least
one goal are considered, and goals during the minute of extra time are
denoted as :

6 24 90+1 8 4 25 3 83 89 34 25 24 18 6
23 10 28 4 63 6 60 5 40 2 22 26 23 26
44 49 34 2 33 9 16 55 23 13 23 4 8 26
70 4 6 60 23 90+5 28 49 6 57 33 56 7

(a) What is the scale of X?

(b) Write down the frequency table of X based on the following categories:
[0, 15), [15, 30), [30, 45), [45, 60), [60, 75), [75, 90), [90, 96).
(c) Draw the histogram for X with intervals relating to the groups from the
frequency table.

(d) Now use R to reproduce the histogram. Compare the histogram to a kernel
density plot of your choice.

(e) Calculate the empirical cumulative distribution function for the grouped
data.

(f) Use R to plot the ECDF (via a step function) for

(i) the original data and

(ii) the grouped data.

(g) Consider the grouped data. Now assume that the values within each
interval are distributed uniformly. Determine the proportion of first goals
which occurred

(i) in the first half, i.e. during the first 45 min,

(ii) in the last 10 min or during the extra time,

(iii) between the 20 and 65 min, i.e. what is ?

(h) Determine the time point at which in 80 % of the matches the first goal was
scored at or before this time point.

Exercise 2.3
Suppose we have the following information to construct a histogram for a
continuous variable with 2000 observations:
j

1 0 1 1 0.125
2 1 4 3 0.125
3 4 7 3 0.125
4 7 8 1 0.125

(a) Determine the relative frequencies for each interval.

(b) Determine the absolute frequencies.

Exercise 2.4
A university survey was conducted on 500 first-year students to obtain
knowledge about the size of their accommodation (in square metres).

j Size of accommodation ( ) F(x)

1 8–14 0.25
2 14–22 0.40
3 22–34 0.75
4 34–50 0.97
5 50–82 1.00

(a) Determine the absolute frequencies for each category.

(b) What proportion of people live in a flat of at least 34 ?

Exercise 2.5
Consider a survey in which 100 people were asked to rate on a scale from 1 to
10 how much they agree with the statement that “there is too much football on
television”. The results are summarized below:

Score 0 1 2 3 4 5 6 7 8 9 10
Responses 0 1 3 8 8 27 30 11 6 4 2

(a) Calculate and draw the ECDF of the scores.


(b) Determine F(3) and F(9).

(c) Consider the situation, where the data is summarized in the two categories
“disagree” (score ) and “agree” (score ). What would the ECDF look
like under the approach outlined in (2.11)? Determine F(3) and F(9) for the
summarized data.

(d) Explain the differences between (b) and (c).

Exercise 2.6
It is possible to produce professional graphics in R. However, it is advantageous
to go beyond the default options. To demonstrate this, consider Example 2.1.3
about the pizza delivery data, which is described in Appendix A.4.

(a) Set the working directory in R (setwd()), read in the data


(read.csv()), and attach the data. Draw a histogram of the variable
“temperature”. Type ?hist, and view the options. Adjust the histogram so
that you are satisfied with (i) axes labelling, (ii) axes range, and (iii) colour.
Now use the lines() command to add a dashed vertical line at 65 C
(which is the minimum temperature the pizza should have at the time of
delivery).

(b) Consider a different approach, which constructs plots by means of multiple


layers using ggplot2 . You need an Internet connection to install the
package using the command install.packages(’ggplot2’).
Browse through the help pages on http://docs.ggplot2.org/current/. Look
specifically at the examples for ggplot, qplot,
scale_histogram, and scale_y_continuous. Try to
understand the roles of “aesthetics” and “geoms”. Now, after loading the
library via library(ggplot2), create a ggplot object for the pizza
data, which declares “temperature” to be the x-variable. Now add a layer
with geom_histogram to create a histogram with interval width of 2.5
and dark grey bars which are 50 % transparent. Change the y-axis labelling
by adding the relevant layer using scale_y_continuous. Plot the
graph.

(c) Now create a normal bar chart for the variable “driver” in R. Type ?
barplot and ?par to see the options one can pass on to barchart()
to adjust the graph. Make the graph look good.

(d) Now create the same bar chart with ggplot2. Use qplot instead of
ggplot to create the plot. Use an option which makes each bar to consist
of segments relating to the day of delivery, so that one can see the number
of deliveries by driver to highlight during which days the drivers delivered
most often. Browse through “themes” and “scales” on the help page, and
add layers that make the background black and white and the bars on a
grey scale.

Source Toutenburg, H., Heumann, C., Deskriptive Statistik, 7th edition,


2009, Springer, Heidelberg
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_3

3. Measures of Central Tendency and


Dispersion
Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

A data set may contain many variables and observations. However, we are not
always interested in each of the measured values but rather in a summary which
interprets the data. Statistical functions fulfil the purpose of summarizing the
data in a meaningful yet concise way.

Example 3.0.1
Suppose someone from Munich (Germany) plans a holiday in Bangkok
(Thailand) during the month of December and would like to get information
about the weather when preparing for the trip. Suppose last year’s maximum
temperatures during the day (in degrees Celsius) for December 1–31 are as
follows:

How do we draw conclusions from this data? Looking at the individual values
gives us a feeling about the temperatures one can experience in Bangkok, but it
does not provide us with a clear summary. It is evident that the average of these
31 values as “Sum of all values/Total number of observations”
is meaningful in the sense that we know what
temperature to expect “on average”. To choose the right clothing for the
holidays, we may also be interested in knowing the temperature range to
understand the variability in temperature, which is between 21 and 31 .
Summarizing 31 individual values with only three numbers (26.48, 21, and 31)
will provide sufficient information to plan the holidays.

In this chapter, we focus on the most important statistical concepts to summarize


data: these are measures of central tendency and variability. The applications of
each measure depend on the scale of the variable of interest, see Appendix D.1
for a detailed summary.

3.1 Measures of Central Tendency


A natural human tendency is to make comparisons with the “average”. For
example, a student scoring in an examination will be happy with the result
if the average score of the class is . If the average class score is , then
the student may not feel happy even if he got right. Some other examples of
the use of “average” values in common life are mean body height, mean
temperature in July in some town, the most often selected study subject, the most
popular TV show in 2015, and average income. Various statistical concepts refer
to the “average” of the data, but the right choice depends upon the nature and
scale of the data as well as the objective of the study. We call statistical functions
which describe the average or centre of the data location parameters or
measures of central tendency .

3.1.1 Arithmetic Mean


The arithmetic mean is one of the most intuitive measures of central tendency.
Suppose a variable of size n consists of the values . The arithmetic
mean of this data is defined as

(3.1)
In informal language, we often speak of “the average” or just “the mean” when
using the formula (3.1).
To calculate the arithmetic mean for grouped data, we need the following
frequency table:

Class intervals ...

Absolute freq. ...

Relative freq. ...

Note that are the k class intervals and each interval


contains observations with . The relative frequency
of the jth class is and . The mid-value of the class interval
is defined as , which is the mean of the lower and upper limits of
the interval. The weighted arithmetic mean for grouped data is defined as

(3.2)

Example 3.1.1
Consider again Example 3.0.1 where we looked at the temperature in Bangkok
during December. The measurements were

The arithmetic mean is therefore

In R, the arithmetic mean can be calculated using the mean command:

Let us assume the data in Example 3.0.1 is summarized in categories as


follows:

Class intervals (25, 30] (30, 35]


Absolute frequencies

Relative frequencies

We can calculate the (weighted) arithmetic mean as

In R, we use the weighted.mean function to obtain the result. The


function requires to specify the (hypothesized) means for each group, for
example the middle values of the class intervals, as well as the weights.

Interestingly, the results of the mean and the weighted mean differ. This is
because we use the middle of each class as an approximation of the mean within
the class. The implication is that we assume that the values are uniformly
distributed within each interval. This assumption is obviously not met. If we had
knowledge about the mean in each class, like in this example, we would obtain
the correct result as follows:

However, the weighted mean is meant to estimate the arithmetic mean in


those situations where only grouped data is available. It is therefore typically
used to obtain an approximation of the true mean.

Properties of the Arithmetic Mean.

(i) The sum of the deviations of each variable around the arithmetic mean is
zero:

(3.3)

(ii) If the data is linearly transformed as , where a and b are known


constants, it holds that

(3.4)

Example 3.1.2
Recall Examples 3.0.1 and 3.1.1 where we considered the temperatures in
December in Bangkok. We measured them in degrees Celsius, but someone from
the USA might prefer to know them in degrees Fahrenheit. With a linear
transformation, we can create a new temperature variable as

Using , we get .

3.1.2 Median and Quantiles


The median is the value which divides the observations into two equal parts such
that at least 50 % of the values are greater than or equal to the median and at
least 50 % of the values are less than or equal to the median. The median is
denoted by ; then, in terms of the empirical cumulative distribution function,
the condition is satisfied. Consider the n observations
which can be ordered as . The calculation of the median
depends on whether the number of observations n is odd or even. When n is odd,
then is the middle ordered value. When n is even, then is the arithmetic
mean of the two middle ordered values:

(3.5)

Example 3.1.3
Consider again Examples 3.0.1–3.1.2 where we evaluated the temperature in
Bangkok in December. The ordered values , are as follows:

C 21 22 22 23 24 24 25 25 25 25 25 25 26 26 26 26
(i) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
C 27 27 27 28 28 28 29 29 29 29 29 30 30 30 31
(i) 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
We have , and therefore . Therefore,
at least 50 % of the 31 observations are greater than or equal to 26 and at least
50 % are less than or equal to 26. If one value was missing, let us say the last
observation, then the median would be calculated as
. In R, we would have obtained the results
using the median command:

If we deal with grouped data, we can calculate the median under the assumption
that the values within each class are equally distributed. Let be k
classes with observations of size , respectively. First, we need to
determine which class is the median class, i.e. the class that includes the median.
We define the median class as the class for which

(3.6)
hold. Then, we can determine the median as

(3.7)
where denotes the lower limit of the interval and is the width of the
interval .

Example 3.1.4
Recall Example 3.1.1 where we looked at the grouped temperature data:

Class intervals (20–25] (25, 30] (30, 35]

0 1 1

For the third class ( ), we have


We can therefore calculate the median as

Comparing the Mean with the Median. In the above examples, the mean and
the median turn out to be quite similar to each other. This is because we looked
at data which is symmetrically distributed around its centre, i.e. on average, we
can expect 26 with deviations that are similar above and below the average
temperature. A similar example is given in Fig. 3.1a: we see that the raw data is
summarized by using ticks at the bottom of the graph and by using a kernel
density estimator. The mean and the median are similar here because the
distribution of the observations is symmetric around the centre. If we have
skewed data (Fig. 3.1b), then the mean and the median may differ. If the data has
more than one centre, such as in Fig. 3.1c, neither the median nor the mean has
meaningful interpretations. If we have outliers (Fig. 3.1d), then it is wise to use
the median because the mean is sensitive to outliers. These examples show that
depending on the situation of interest either the mean, the median, both or
neither of them can be useful.
Fig. 3.1 Arithmetic mean and median for different data

Quantiles. Quantiles are a generalization of the idea of the median. The


median is the value which splits the data into two equal parts. Similarly, a
quantile partitions the data into other proportions. For example, a 25 %-quantile
splits the data into two parts such that at least 25 % of the values are less than or
equal to the quantile and at least 75 % of the values are greater than or equal to
the quantile. In general, let be a number between zero and one. The
-quantile, denoted as , is defined as the value which divides the data in
proportions of and such that at least of the
values are less than or equal to the quantile and at least of the
values are greater than or equal to the quantile. In terms of the empirical
cumulative distribution function, we can write . It follows immediately
that for n observations, at least values are less than or equal to and at least
observations are greater than or equal to . The median is the 50 %-
quantile . If takes the values , the quantiles are called deciles.
If is an integer number (e.g. ), the quantiles are called
percentiles , i.e. the data is divided into 100 equal parts. If takes the values
0.2, 0.4, 0.6, and 0.8, the quantiles are known as quintiles and they divide the
data into five equal parts. If takes the values , the quantiles are
called quartiles .
Consider n ordered observations . The -quantile
is calculated as

(3.8)

Example 3.1.5
Recall Examples 3.0.1–3.1.4 where we evaluated the temperature in Bangkok in
December. The ordered values , are as follows:

C 21 22 22 23 24 24 25 25 25 25 25 25 26 26 26 26
(i) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
C 27 27 27 28 28 28 29 29 29 29 29 30 30 30 31
(i) 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

To determine the quartiles, i.e. the 25, 50, and 75 % quantiles, we calculate
as , , and . Using (3.8), it follows
that

In R, we obtain the same results using the quantile function. The probs
argument is used to specify . By default, the quartiles are reported.

However, please note that R offers nine different ways to obtain quantiles,
each of which can be chosen by the type argument. See Hyndman and Fan
(1996) for more details.

3.1.3 Quantile–Quantile Plots (QQ-Plots)


If we plot the quantiles of two variables against each other, we obtain a
Quantile–Quantile plot (QQ-plot). This provides a simple summary of whether
the distributions of the two variables are similar with respect to their location or
not.

Example 3.1.6
Consider again the pizza data which is described in Appendix A.4. We may be
interested in the delivery time for different drivers to see if their performance is
the same. Figure 3.2a shows a QQ-plot for the delivery time of driver Luigi and
the delivery time of driver Domenico. Each point refers to the % quantile of
both drivers. If the point lies on the bisection line, then they are identical and we
conclude that the quantiles of the both drivers are the same. If the point is below
the line, then the quantile is higher for Luigi, and if the point is above the line,
then the quantile is lower for Luigi. So if all the points lie exactly on the line, we
can conclude that the distributions of both the drivers are the same. We see that
all the reported quantiles lie below the line, which implies that all the quantiles
of Luigi have higher values than those of Domenico. This means that not only on
an average, but also in general, the delivery times are higher for Luigi. If we
look at two other drivers, as displayed in Fig. 3.2b, the points lie very much on
the bisection line. We can therefore conclude that the delivery times of these two
drivers do not differ much.

Fig. 3.2 QQ-plots for the pizza delivery time for different drivers
In R, we can generate QQ-plots by using the qqplot command:

As a summary, let us consider four important patterns:

(a) If all the pairs of quantiles lie (nearly) on a straight line at an angle of 45 %
from the x-axis, then the two samples have similar distributions (Fig. 3.3a).

(b) If the y-quantiles are lower than the x-quantiles, then the y-values have a
tendency to be lower than the x-values (Fig. 3.3b).

(c) If the x-quantiles are lower than the y-quantiles, then the x-values have a
tendency to be lower than the y-values (Fig. 3.3c).

(d) If the QQ-plot is like Fig. 3.3d, it indicates that there is a break point up to
which the y-quantiles are lower than the x-quantiles and after that point, the
y-quantiles are higher than the x-quantiles.

Fig. 3.3 Different patterns for a QQ-plot

3.1.4 Mode
Consider a situation in which an ice cream shop owner wants to know which
flavour of ice cream is the most popular among his customers. Similarly, a
footwear shop owner may like to find out what design and size of shoes are in
highest demand. To answer this type of questions, one can use the mode which is
another measure of central tendency.
The mode of n observations is the value which occurs the
most compared with all other values, i.e. the value which has maximum absolute
frequency. It may happen that two or more values occur with the same frequency
in which case the mode is not uniquely defined. A formal definition of the mode
is
(3.9)
The mode is typically applied to any type of variable for which the number
of different values is not too large. If continuous data is summarized in groups,
then the mode can be used as well.

Example 3.1.7
Recall the pizza data set described in Appendix A.4. The pizza delivery service
has three branches, in the East, West, and Centre, respectively. Suppose we want
to know which branch delivers the most pizzas. We find that most of the
deliveries have been made in the West, see Fig. 3.4a; therefore the mode is
. Similarly, suppose we also want to find the mode for the categorized
pizza delivery time: if we group the delivery time in intervals of 5 min, then we
see that the most frequent delivery time is the interval “ ” min, see
Fig. 3.4b. The mode is therefore .

Fig. 3.4 Results from the pizza data set

3.1.5 Geometric Mean


Consider n observations which are all positive and collected on a
quantitative variable. The geometric mean of this data is defined as
(3.10)

The geometric mean plays an important role in fields where we are interested in
products of observations, such as when we look at percentage changes in
quantities. We illustrate its interpretation and use by looking at the average
growth of a quantity in the sense that we allow a starting value, such as a certain
amount of money or a particular population, to change over time. Suppose we
have a starting value at some baseline time point 0 (zero), which may be denoted
as . At time t, this value may have changed and we therefore denote it as
The ratio of and ,

is called the growth factor . The growth rate is defined as

and gives us an idea about the growth or decline of our value at time t. We can
summarize these concepts in the following table:

Time Inventory Growth factor Growth rate


t

0 – –

We can calculate ( ) by using the growth factors:

The average growth factor from to is the geometric mean or geometric


average of the growth factors:
(3.11)

Therefore, at time t can be calculated as .

Example 3.1.8
Suppose someone wants to deposit money, say €1000, in a bank. The bank
advisor proposes a 5-year savings plan with the following plan for interest rates:
1 % in the first year, 1.5 % in the second year, 2.5 % in the third year, and 3 % in
the last 2 years. Now he would like to calculate the average growth factor and
average growth rate for the invested money. The concept of the geometric mean
can be used as follows:

Year Euro Growth factor Growth rate (%)


0 1000 – –
1 1010 1.01 1.0
2 1025.15 1.015 1.5
3 1050.78 1.025 2.5
4 1082.30 1.03 3.0
5 1114.77 1.03 3.0

The geometric mean is calculated as

which means that he will have on average about 2.2 % growth per year. The
savings after 5 years can be calculated as

It is easy to compare two different saving plans with different growth


strategies using the geometric mean.

3.1.6 Harmonic Mean


The harmonic mean is typically used whenever different contribute to the
mean with a different weight , i.e. when we implicitly assume that the weight
of each is not one. It can be calculated as
(3.12)
For example, when calculating the average speed, each weight relates to the
relative distance travelled, , with speed . Using and
, the harmonic mean can be written as

(3.13)

Example 3.1.9
Suppose an investor bought shares worth €1000 for two consecutive months.
The price for a share was €50 in the first month and €200 in the second month.
What is the average purchase price? The number of shares purchased in the first
month is . The number of shares purchased in the second month is
. The total number of shares purchased is thus , and the
total investment is €2000. It is evident that the average purchase price is
€80. This is in fact the harmonic mean calculated as

because the weight of each purchase is . If the investment


was €1200 in the first month and €800 in the second month, then we could use
the harmonic mean with weights and ,
respectively, to obtain the results.

3.2 Measures of Dispersion


Measures of central tendency, as introduced earlier, give us an idea about the
location where most of the data is concentrated. However, two different data sets
may have the same value for the measure of central tendency, say the same
arithmetic means, but they may have different concentrations around the mean.
In this case, the location measures may not be adequate enough to describe the
distribution of the data. The concentration or dispersion of observations around
any particular value is another property which characterizes the data and its
distribution. We now introduce statistical methods which describe the
variability or dispersion of data.
Example 3.2.1
Suppose three students Christine, Andreas, and Sandro arrive at different times
in the class to attend their lectures. Let us look at their arrival time in the class
after or before the starting time of lecture, i.e. let us look how early or late they
were (in minutes).

Week 1 2 3 4 5 6 7 8 9 10
Christine 0 0 0 0 0 0 0 0 0 0
Andreas

Sandro 3 5 6 2 4 6 8 4 5 7

We see that Christine always arrives on time (time difference of zero).


Andreas arrives sometimes 10 min early and sometimes 10 min late. However,
the arithmetic mean of both students is the same—on average, they both arrive
on time! This interpretation is obviously not meaningful. The difference between
both students is the variability in arrival times that cannot be measured with the
mean or median. For this reason, we need to introduce measures of dispersion
(variability). With the knowledge of both location and dispersion, we can give a
much more nuanced comparison between the different arrival times. For
example, consider the third student Sandro. He is always late; sometimes more,
sometimes less. However, while on average he comes late, his behaviour is more
predictable than that of Andreas. Both location and dispersion are needed to give
a fair comparison.

Example 3.2.2
Consider another example in which a supplier for the car industry needs to
deliver 10 car doors with an exact width of 1.00 m. He supplies 5 doors with a
width of 1.05 m and the remaining 5 doors with a width of 0.95 m. The
arithmetic mean of all the 10 doors is 1.00 m. Based on the arithmetic mean, one
may conclude that all the doors are good but the fact is that none of the doors are
usable as they will not fit into the car. This knowledge can be summarized by a
measure of dispersion.

The above examples highlight that the distribution of a variable needs to be


characterized by a measure of dispersion in addition to a measure of location
(central tendency). Now we introduce various measures of dispersion.

3.2.1 Range and Interquartile Range


Consider a variable X with n observations . Order these n
observations as The range is a measure of dispersion defined
as the difference between the maximum and minimum value of the data as
(3.14)
The interquartile range is defined as the difference between the and
quartiles as
(3.15)
It covers the centre of the distribution and contains 50 % of the observations.

Remark 3.2.1
Note that the interquartile range is defined as the interval in some
literature. However, in line with most of the statistical literature, we define the
interquartile range to be a measure of dispersion, i.e. the difference between
and .

Example 3.2.3
Recall Examples 3.0.1–3.1.5 where we looked at the temperature in Bangkok
during December. The ordered values , are as follows:

C 21 22 22 23 24 24 25 25 25 25 25 25 26 26 26 26
(i) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
C 27 27 27 28 28 28 29 29 29 29 29 30 30 30 31
(i) 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

We obtained the quantiles in Example 3.1.5 as and . The


interquartile range is therefore , which means that 50 % of the
data is centred between 25 and 29 . The range is , meaning
that the temperature is varying at most by 10 . In R, there are several ways to
obtain quartiles, minimum and maximum values, e.g. by using min, max,
quantiles, range, among others. All numbers can be easily obtained by the
summary command which we recommend using.
3.2.2 Absolute Deviation, Variance, and Standard
Deviation
Another measure of dispersion is the variance. The variance is one of the most
important measures in statistics and is needed throughout this book. We use the
idea of “absolute deviation” to give some more background and motivation for
understanding the variance as a measure of dispersion, followed by some
examples.
Consider the deviations of n observations around a certain value “A” and
combine them together, for instance, via the arithmetic mean of all the
deviations:

(3.16)
This measure has the drawback that the deviations , can
be either positive or negative and, consequently, their sum can potentially be
very small or even zero. Using D as a measure of variability is therefore not a
good idea since D may be small even for a large variability in the data.
Using absolute values of the deviations solves this problem, and we
introduce the following measure of dispersion:

(3.17)
It can be shown that the absolute deviation attains its minimum when A
corresponds to the median of the data:

(3.18)
We call the absolute median deviation . When , we speak of
the absolute mean deviation given by

(3.19)
Another solution to avoid the positive and negative signs of deviation in
(3.16) is to consider the squares of deviations , rather than using the
absolute value. This provides another measure of dispersion as
(3.20)
which is known as the mean squared error (MSE) with respect to A . The
MSE is another important measure in statistics, see Chap. 9, Eq. (9.4), for
details. It can be shown that attains its minimum value when . This is
the (sample) variance

(3.21)
After expanding , we can write (3.21) as

(3.22)
The positive square root of the variance is called the (sample) standard
deviation, defined as

(3.23)
The standard deviation has the same unit of measurement as the data whereas
the unit of the variance is the square of the units of the observations. For
example, if X is weight, measured in kg, then and are also measured in kg,
while is measured in (which may be more difficult to interpret). The
variance is a measure which we use in other chapters to obtain measures of
association between variables and to draw conclusions from a sample about a
population of interest; however, the standard deviation is typically preferred for a
descriptive summary of the dispersion of data.
The standard deviation measures how much the observations vary or how
they are dispersed around the arithmetic mean. A low value of the standard
deviation indicates that the values are highly concentrated around the mean. A
high value of the standard deviation indicates lower concentration of the
observations around the mean, and some of the observed values may even be far
away from the mean. If there are extreme values or outliers in the data, then the
arithmetic mean is more sensitive to outliers than the median. In such a case, the
absolute median deviation (3.18) may be preferred over the standard deviation.

Example 3.2.4
Consider again Example 3.2.1 where we evaluated the arrival times of Christine,
Andreas, and Sandro in their lecture. Using the arithmetic mean, we concluded
that both Andreas and Christine arrive on time, whereas Sandro is always late;
however, we saw that the variation of arrival times differs substantially among
the three students. To describe and quantify this variability formally, we
calculate the variance and absolute median deviation:

We observe that the variation/dispersion/variability is the lowest for Christine


and highest for Andreas. Both median absolute deviation and variance allow a
comparison between the two students. If we take the square root of the variance,
we obtain the standard deviation. For example, , which means that
the average difference of the observations from the arithmetic mean is 1.8.
In R, we can use the var command to calculate the variance. However, note
that R uses instead of 1 / n in calculating the variance. The idea behind
the multiplication by in place of 1 / n is discussed in Chap. 9, see also
Theorem 9.2.1.

Variance for Grouped Data. The variance for grouped data can be calculated
using

(3.24)

where is the middle value of the interval. However, when the data is
artificially grouped and the knowledge about the original ungrouped data is
available, we can also use the arithmetic mean of the class:

(3.25)
The two expressions (3.24) and (3.25) represent the variance between the
different classes , i.e. they describe the variability of the class specific means ,
weighted by the size of each class , around the overall mean . It is evident
that the variance within each class is not taken into account in these formulae.
The variability of measurements in each class, i.e. the variability of , is
another important component to determine the overall variance in the data. It is
therefore not surprising that using only the between variance will
underestimate the total variance and therefore
(3.26)
If the data within each class is known, we can use the Theorem of Variance
Decomposition (see p. 136 for the theoretical background) to determine the
variance. This allows us to represent the total variance as the sum of the
variance between the different classes and the variance within the different
classes as

(3.27)

In (3.27), is the variance of the class:

(3.28)
The proof of (3.27) is given in Appendix C.1, p. 423.

Example 3.2.5
Recall the weather data used in Examples 3.0.1–3.2.3 and the grouped data
specified as follows:

Class intervals (20–25] (25, 30] (30, 35]

– 23.83 28 31 –
– 1.972 2 0 –

We know that and . The first step is to calculate the mean and
variances in each class using (3.28). We then obtain and as listed above.
The within and between variances are as follows:

The total variance is therefore . Estimating the variance using all 31


observations would yield the same results. However, it becomes clear that
without knowledge about the variance within each class, we cannot reliably
estimate . In the above example, the variance between the classes is 3 times
lower than the total variance which is a serious underestimation.

Linear Transformations. Let us consider a linear transformation


of the original data . We get the arithmetic mean of the
transformed data as and for the variance:

(3.29)

Example 3.2.6
Let denote measurements on time. These data could have been
recorded and analysed in hours, but we may be interested in a summary in
minutes. We can make a linear transformation . Then, and
. If the mean and variance of the ’s have already been obtained, then
the mean and variance of the ’s can be obtained directly using these
transformations.

Standardization. A variable is called standardized if its mean is zero and its


variance is 1. Standardization can be achieved by using the following
transformation:

(3.30)
It follows that and . There are many
statistical methods which require standardization, see, for example, Sect. 10.3.1
for details in the context of statistical tests.

Example 3.2.7
Let X be a variable which measures air pollution by using the concentration of
atmospheric particulate matter (in ). Suppose we have the following 10
measurements:

We calculate , , and . To get a standardized


variable Y, we transform all the observations ’s as

Now , , , are the


standardized observations. The scale command in R allows standardization,
and we can obtain the standardized observations corresponding to the 10
measurements as

Please note that the scale command uses for calculating the
variance, as already outlined above. Thus, the results provided by scale are not
identical to those using (3.30).

3.2.3 Coefficient of Variation


Consider a situation where two different variables have arithmetic means and
with standard deviations and , respectively. Suppose we want to compare
the variability of hotel prices in Munich (measured in euros) and London
(measured in British pounds). How can we provide a fair comparison? Since the
prices are measured in different units, and therefore likely have arithmetic means
which differ substantially, it does not make much sense to compare the standard
deviations directly. The coefficient of variation v is a measure of dispersion
which uses both the standard deviation and mean and thus allows a fair
comparison. It is properly defined only when all the values of a variable are
measured on a ratio scale and are positive such that holds. It is defined as

(3.31)
The coefficient of variation is a unit-free measure of dispersion. It is often used
when the measurements of two variables are different but can be put into relation
by using a linear transformation . It is possible to show that if all values
of a variable X are transformed into a variable Y with values , ,
then v does not change.

Example 3.2.8
If we want to compare the variability of hotel prices in two selected cities in
Germany and England, we could calculate the mean prices, together with their
standard deviation. Suppose a sample of prices of say 100 hotels in two selected
cities in Germany and England is available and suppose we obtain the mean and
standard deviations of the two cities as €130, , €99, and
. Then, and . This indicates higher
variability in hotel prices in England. However, if the data distribution is skewed
or bimodal, then it may be wise not to choose the arithmetic mean as a measure
of central tendency and likewise the coefficient of variation.

3.3 Box Plots


So far we have described various measures of central tendency and dispersion. It
can be tedious to list those measures in summary tables. A simple and powerful
graph is the box plot which summarizes the distribution of a continuous (or
sometimes an ordinal) variable by using its median, quartiles, minimum,
maximum, and extreme values.
Figure 3.5a shows a typical box plot. The vertical length of the box is the
interquartile range , which shows the region that contains 50 % of
the data. The bottom end of the box refers to the first quartile, and the top end of
the box refers to the third quartile. The thick line in the box is the median. It
becomes immediately clear that the box indicates the symmetry of the data: if
the median is in the middle of the box, the data should be symmetric, otherwise
it is skewed. The whiskers at the end of the plot mark the minimum and
maximum values of the data. Looking at the box plot as a whole tells us about
the data distribution and the range and variability of observations. Sometimes, it
may be advisable to understand which values are extreme in the sense that they
are “far away” from the centre of the distribution. In many software packages,
including R, values are defined to be extreme if they are greater than 1.5 box
lengths away from the first or third quartile. Sometimes, they are called outliers.
Outliers and extreme values are defined differently in some software packages
and books.

The boxplot command in R draws a box plot. The range option controls
whether extreme values should be plotted, and if yes, how one wants to define
such values.

Example 3.3.1
Recall Examples 3.0.1–3.2.5 where we looked at the temperature in Bangkok
during December. We have already calculated the median (26 C) and the
quartiles (25, 29 C). The minimum and maximum values are 21 C and 31 C.
The box plot for this data is shown in Fig. 3.5a. One can see that the temperature
distribution is slightly skewed with more variability for lower temperatures. The
interquartile range is 4, and therefore, any value or
would be an extreme value. However, there are no extreme
values in the data.

Example 3.3.2
Consider again the pizza data described in Appendix A.4. We use R to plot the
box plot for the delivery time via boxplot(time) (Fig. 3.5b). We see a
symmetric distribution with a median delivery time of about 35 min. Most of the
deliveries took between 30 and 40 min. The extreme values indicate that there
were some exceptionally short and long delivery times.

3.4 Measures of Concentration


A completely different concept used to describe a quantitative variable is the
idea of concentration. For a variable X, it summarizes the proportion of each
observation with respect to the sum of all observations . Let us look at a
simple example to demonstrate its usefulness.

Example 3.4.1
Consider a village with 5 farms. Each farmer has a farm of a certain size. How
can we evaluate the land distribution? Do all farmers have a similar amount of
land or do one or two farmers have a big advantage because they have
considerably more space?
Table 3.1 Concentration of farmland: two different situations

Farmer (i)
(Area, in hectare)

1 20
2 20
3 20
4 20
5 20

Farmer (i) (Area, in hectare)

1 0
2 0
3 0
4 0
5 100

Table 3.1 shows two different situations: in the table on the left, we see an
equal distribution of land, i.e. each farmer owns 20 hectares of farmland. This
means X is not concentrated, rather it is equally distributed. A statistical function
describing the concentration could return a value of zero in such a case. Consider
another extreme where one farmer owns all the farmland and the others do not
own anything, as shown on the right side of Table 3.1. This is an extreme
concentration of land: one person owns everything and thus, we say the
concentration is high. A statistical function describing the concentration could
return a value of one in such a case.

3.4.1 Lorenz Curve


The Lorenz curve is a popular method to display concentrations graphically.
Consider n observations of a variable X. Assume that all the
observations are positive. The sum of all the observations is if the
data is ungrouped. First, we need to order the data: . To
plot the Lorenz curve, we need

(3.32)
and
(3.33)
where is the cumulative total of observations up to the observation.
The idea is that describe the contribution of all values in comparison with
the sum of all values. Plotting against for all i shows how much the sum of
all , for all observations , contributes to the total sum. In other words, the
point says that % of observations contain % of the sum of
all less than or equal to i. Obviously, if all are identical, the Lorenz curve
will be a straight diagonal line, also known as the identity line or line of equality
. If the are of different sizes, then the Lorenz curve falls below the line of
equality. This is illustrated in the following example.

Example 3.4.2
Recall Example 3.4.1 where we looked at the distribution of farmland among 5
farmers. On the upper panel of Table 3.1, we observed an equal distribution of
land among the farmers: , , , , and . We obtain
and . This yields a
Lorenz curve as displayed on the left side of Fig. 3.5: there is no concentration.
We can interpret each point. For example, means that 40 % of
farmers own 40 % of the land.
The lower panel of Table 3.1 describes the situation with strong
concentration. For this table, we obtain and
. Therefore, for example, 80 % of farmers own 0 % of the
land which shows strong inequality. Most often we do not have such extreme
situations. In this case, the Lorenz curve is bent towards the lower right corner of
the plot, see the right side of Fig. 3.5.
We can plot the Lorenz curve in R using the Lc command in the library
ineq . The Lorenz curve for the left table of Example 3.4.1 is plotted in R as
follows:
Fig. 3.5 Lorenz curves for no concentration (left) and some concentration (right)

We can use the same approach as above to obtain the Lorenz curve when we
have grouped data. We simply describe the contributions for each class rather
than for each observation and approximate the values in each class by using its
mid-point. More formally we can write:

(3.34)
and

(3.35)

3.4.2 Gini Coefficient


We have seen in Sect. 3.4.1 that the Lorenz curve corresponds to the identity
line, that is the diagonal line of equality, for no concentration. When there is
some concentration, then the curve deviates from this line. The amount of
deviation depends on the strength of concentration. Suppose we want to design a
measure of concentration which is 0 for no concentration and 1 for perfect (i.e.
extreme) concentration. We can simply measure the area between the Lorenz
curve and the identity line and multiply it by 2. For no concentration, the area
will be zero and hence the measure will be zero. If there is perfect concentration,
then the curve will coincide with the axes, the area will be close to 0.5, and twice
the area will be close to one. The measure based on such an approach is called
the Gini coefficient :
(3.36)
Note that F is the area between the curve and the bisection or diagonal line.

Fig. 3.6 Lorenz curve and the Gini coefficient

The Gini coefficient can be estimated by adding up the areas of the


trapeziums as displayed in Fig. 3.6:

where

It can be shown that this corresponds to

(3.37)
but the proof is omitted. The same formula can be used for grouped data
except that is used instead of v. Since

(3.38)
one may prefer to use the standardized Gini coefficient

(3.39)
which takes a maximum value of 1.
Example 3.4.3
We return to our farmland example. Suppose we have 7 farmers with farms of
different sizes:

Farmer 1 2 3 4 5 6 7
Farmland size 20 14 59 9 36 23 3

Using the ordered values, we can calculate and using (3.32) and (3.33):

1 3

2 9

3 14

4 20

5 23

6 36

7 59
Fig. 3.7 Lorenz curve for Example 3.4.3

The Lorenz curve is displayed in Fig. 3.7. Using this information, it is easy
to calculate the Gini coefficient:

We know that . To standardize the coefficient, we


therefore have to use (3.39):

In R, we can obtain the non-standardized Gini Coefficient using the ineq


function in the library ineq .

3.5 Key Points and Further Issues


3.6 Exercises
Exercise 3.1
A hiking enthusiast has a new app for his smartphone which summarizes his
hikes by using a GPS device. Let us look at the distance hiked (in km) and
maximum altitude (in m) for the last 10 hikes:

Distance 12.5 29.9 14.8 18.7 7.6 16.2 16.5 27.4 12.1 17.5
Altitude 342 1245 502 555 398 670 796 912 238 466

(a) Calculate the arithmetic mean and median for both distance and altitude.

(b) Determine the first and third quartiles for both the distance and the altitude
variables. Discuss the shape of the distribution given the results of (a) and
(b).

(c) Calculate the interquartile range, absolute median deviation, and standard
deviation for both variables. What is your conclusion about the variability
of the data?
(d) One metre corresponds to approximately 3.28 ft. What is the average
altitude when measured in feet rather than in metres?

(e) Draw and interpret the box plot for both distance and altitude.

(f) Assume distance is measured as only short (5–15 km), moderate (15–
20 km), and long (20–30 km). Summarize the grouped data in a frequency
table. Calculate the weighted arithmetic mean under the assumption that
the raw data is not known. Determine the weighted median under the
assumption that the values within each class are equally distributed.

(g) What is the variance for the grouped data when the raw data is known, i.e.
when one has knowledge about the variance in each class? How does it
compare with the variance one obtains when the raw data is unknown?

(h) Use R to reproduce the results of (a), (b), (c), (e), and (f).

Exercise 3.2
A gambler notes down his wins and losses (in €) from playing 10 games of
roulette in a casino.

Round 1 2 3 4 5 6 7 8 9 10
Won/Lost 200 600 0

(a) Assume €90 and €294.7881. What is the result of round 10?

(b) Determine the mode and the interquartile range.

(c) A different gambler plays 33 rounds of roulette. His results are €12 and
€1000. Is it meaningful to compare the variability of results of the two
players by using the coefficient of variation? If yes, determine the
coefficients of variation; if no, why is a comparison not possible?

Exercise 3.3
A fashion boutique has summarized its daily sales of designer socks in different
groups: men’s socks, women’s socks, and children’s socks. Unfortunately, the
data for men’s socks was lost. Determine the missing values.

n Arithmetic mean Standard deviation


in € in €
Women’s wear 45 16

Men’s wear ? ? ?
Children’s wear 20 7.5

Total 100 15

Exercise 3.4
The number of members of a millionaires’ club were as follows:

Year 2011 2012 2013 2014 2015 2016


Members 23 24 27 25 30 28

(a) What is the average growth rate of the membership?

(b) Based on the results of (a), how many members would one expect in 2018?

(c) The president of the club is interested in the number of members in 2025,
the year when his presidency ends. Would it make sense to predict the
number of members for 2025?
In 2015, the members invested €250 million on the stock market. 10
members contributed 16 of the investment sum, 8 members contributed €60
million, 8 members contributed €70 million, and another 4 members contributed
the remaining amount.

(d) Draw the Lorenz curve for this data.

(e) Calculate and interpret the standardized Gini coefficient.

Exercise 3.5
Consider the monthly salaries Y (in Swiss francs) of a well-reputed software
company, as well as the length of service (in months, X), and gender (Z).
Figure 3.8 shows the QQ-plots for both Y and X given Z. Interpret both graphs.

Fig. 3.8 QQ-plots

Exercise 3.6
There is no built-in function in R to calculate the mode of a variable. Program
such a function yourself. Hint: type ?table and ?names to recall the
functionality of these functions. Combine them in an intelligent way.

Exercise 3.7
Consider a country in which 90 % of the wealth is owned by 20 % of the
population, the so-called upper class. For simplicity, let us assume that the
wealth is distributed equally within this class.

(a) Draw the Lorenz curve for this country.

(b) Now assume a revolution takes place in the country and all members of the
upper class have to give away their wealth which is then distributed equally
across the remaining population. Draw the Lorenz curve for this scenario.

(c) What would the curve from (b) look like if the entire upper class left the
country?
Exercise 3.8
A bus route in the mountainous regions of Romania has a length of 418 km. The
manager of the bus company serving the route wants his buses to finish a trip
within 8 h. The bus travels the first 180 km with an average speed of 48 km/h,
the next 117 km with an average speed of 37 km/h, and the last section with an
average speed of 52 km/h.

(a) What is the average speed with which the bus travels?

(b) Will the bus finish the trip in time?

Exercise 3.9
Four friends have a start-up company which sells vegan ice cream. Their initial
financial contributions are as follows:

Person 1 2 3 4
Contribution (in €) 800 10300 4700 2220

(a) Calculate and draw the Lorenz curve.

(b) Determine and interpret the standardized Gini coefficient.

(c) Does change if each of the friends contributes only half the amount of
money? If yes, how much? If no, why not?

(d) Use R to draw the above Lorenz curve and to calculate the Gini coefficient.

Exercise 3.10
Recall the pizza delivery data which is described in Appendix A.4. Use R to read
in and analyse the data.

(a) Calculate the mean, median, minimum, maximum, first quartile, and third
quartile for all quantitative variables.

(b) Determine and interpret the 99 % quantile for delivery time and
temperature.

(c) Write a function which calculates the absolute mean deviation. Use the
function to calculate the absolute mean deviation of temperature.

(d) Scale the delivery time and calculate the mean and variance for this
variable.

(e) Draw a box plot for delivery time and temperature. The box plots should
not highlight extreme values.

(f) Use the cut command to create a new variable which summarizes delivery
time in steps of 10 min. Calculate the arithmetic mean of this variable.

(g) Reproduce the QQ-plots shown in Example 3.1.6.

Source Toutenburg, H., Heumann, C., Deskriptive Statistik, 7th edition,


2009, Springer, Heidelberg
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_4

4. Association of Two Variables


Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

In Chaps. 2 and 3 we discussed how to analyse a single variable using graphs


and summary statistics. However, in many situations we may be interested in the
interdependence of two or more variables. For example, suppose we want to
know whether male and female students in a college have any preference
between the subjects mathematics and biology, i.e. if there is any evidence that
male students prefer mathematics over biology and female students prefer
biology over mathematics or vice versa. Suppose we choose an equal number of
male and female students and ask them about their preferred subject. We expect
that if there is no association between the two variables “gender of student”
(male or female) and “subject” (mathematics or biology), then an equal
proportion of male and female students should choose the subjects biology and
mathematics, respectively. Any difference in the proportions may indicate a
preference of males or females for a particular subject. Similarly, in another
example, we may want to find out whether female employees of an organization
are paid less than male employees or vice versa. Let us assume again that we
choose an equal number of male and female employees and assume further that
the salary is measured as a binary variable (low- versus high-salary group). We
then expect that if there is no gender discrimination, the number of male and
female employees in the lower- and higher-salary groups in the organization
should be approximately equal. In both examples, the variables considered are
binary and nominal (although the salary can also be seen as ordinal) and the data
is summarized in terms of frequency measures. There may, however, be
situations in which we are interested in associations between ordinal or
continuous variables. Consider a data set in which height, weight, and age of
infants are given. Usually, the height and weight of infants increase with age.
Also, the height of infants increases with their weight and vice versa. Clearly,
there is an interrelation or association among the three variables. In another
example, two persons have to judge participants of a dance competition and rank
them according to their performance. Now if we want to learn about the fairness
in the judgment, we expect that both the judges give similar ranks to each
candidate, i.e. both judges give high ranks to good candidates and low ranks to
not so good candidates. We are therefore interested in studying the association
between the ranks given by the two judges. In all these examples, the intention
lies in measuring the degree of association between two (or more) variables. We
therefore need to study different ways of measuring the association level for
different types of variables. In this chapter, we present measures and graphical
summaries for the association of two variables—dependent on their scale.

4.1 Summarizing the Distribution of Two Discrete


Variables
When both variables are discrete, then it is possible to list all combinations of
values of the two variables and to count how often these combinations occur in
the data. Consider the salary example in the introduction to this chapter in which
both the variables were binary. There are four possible combinations of variable
categories (female and low-salary group, female and high-salary group, male and
low-salary group, and male and high-salary group). A complete description of
the joint occurrence of these two variables can be given by counting, for each
combination, the number of units for which this combination is measured. In the
following, we generalize this concept to two variables where each can have an
arbitrary (but fixed) number of values or categories.

4.1.1 Contingency Tables for Discrete Data


Suppose we have data on two discrete variables. This data can be described in a
two-dimensional contingency table.

Example 4.1.1
An airline conducts a customer satisfaction survey. The survey includes
questions about travel class and satisfaction levels with respect to different
categories such as seat comfort, in-flight service, meals, safety, and other
indicators. Consider the information on X, denoting the travel class (Economy =
“E”, Business = “B”, First = “F”), and “Y”, denoting the overall satisfaction with
the flight on a scale from 1 to 4 as 1 (poor), 2 (fair), 3 (good), and 4 (very good).
A possible response from 12 customers may look as follows:

Passenger number
i 1 2 3 4 5 6 7 8 9 10 11 12
Travel class E E E B E B F E E B E B
Satisfaction 2 4 1 3 1 2 4 3 2 4 3 3

We can calculate the absolute frequencies for each of the combination of


observed values. For example, there are 2 passengers (passenger numbers 3 and
5) who were flying in economy class and rated the flight quality as poor, there
were no passengers from both business class and first class who rated the flight
quality as poor; there were 2 passengers who were flying in economy class and
rated the quality as fair (2), and so on. Table 4.1 is a two-dimensional table
summarizing this information.
Note that we not only summarize the joint frequency distribution of the two
variables but also the distributions of the individual variables. Summing up the
rows and columns of the table gives the respective frequency distributions. For
example, the last column of the table demonstrates that 7 passengers were flying
in economy class, 4 passengers were flying in business class and 1 passenger in
first class.
Table 4.1 Contingency table for travel class and satisfaction

Overall rating of flight quality Total (row)


Poor Fair Good Very good
Travel class Economy business first 2 2 2 1 7
0 1 2 1 4
0 0 0 1 1
Total (column) 2 3 4 3 12

Table 4.2 contingency table


Y
Total (rows)

Total (columns) n

Now we extend this example and discuss a general framework to summarize


the absolute frequencies of two discrete variables in contingency tables. We use
the following notations: Let be the k classes of a variable X and let
be the l classes of another variable Y. We assume that both X and Y
are discrete variables. It is possible to summarize the absolute frequencies
related to , , , in a contingency table as
shown in Table 4.2.
We denote the sum of the row as and the sum over the
column as . The total number of observations is therefore

(4.1)

Remark 4.1.1
Note that it is also possible to use the relative frequencies instead of
the absolute frequencies in Table 4.2, see Example 4.1.2.

4.1.2 Joint, Marginal, and Conditional Frequency


Distributions
When the data on two variables are summarized in a contingency table, there are
several concepts which can help us in studying the characteristics of the data.
For example, how the values of both the variables behave jointly, how the values
of one variable behave when another variable is kept fixed etc. These features
can be studied using the concepts of joint frequency distribution, marginal
frequency distribution, and conditional frequency distribution. If relative
frequency is used instead of absolute frequency, then we speak of the joint
relative frequency distribution, marginal relative frequency distribution, and
conditional relative frequency distribution.

Definition 4.1.1
Using the notations of Table 4.2, we define the following:

Note that for a bivariate joint frequency distribution, there will only be two
marginal (or relative) frequency distributions but possibly more than two
conditional (or relative) frequency distributions.
Table 4.3 Contingency table for travel class and satisfaction

Overall rating of flight quality


Poor Fair Good Very good Total (rows)
Travel class Economy 10 33 15 4 62
Business 0 3 20 2 25
First 0 0 5 8 13
Total (columns) 10 36 40 14 100

Example 4.1.2
Recall the setup of Example 4.1.1. We now collect and evaluate the responses of
100 customers (instead of 12 passengers as in Example 4.1.1) regarding their
choice of the travel class and their overall satisfaction with the flight quality.
The data is provided in Table 4.3 where each of the cell entries illustrates
how many out of 100 passengers answered and : for example, the first entry
“10” indicates that 10 passengers were flying in economy class and described
the overall service quality as poor.
The marginal frequency distributions are displayed in the last column and
last row, respectively. For example, the marginal distribution of X refers to
the frequency table of “travel class” (X) and tells us that 62 passengers were
flying in economy class, 25 in business class, and 13 in first class.
Similarly, the marginal distribution of “overall rating of flight quality” (Y)
tells us that 10 passengers rated the quality as poor, 36 as fair, 40 as good,
and 14 as very good.
The conditional frequency distributions give us an idea about the behaviour
of one variable when the other one is kept fixed. For example, the
conditional distribution of the “overall rating of flight quality” (Y) among
passengers who were flying in economy class ( ) gives
which means that approximately of the customers
in economy class are rating the quality as poor, of the
customers in economy class are rating the quality as fair,
of the customers in economy class are rating the quality as good and
of the customers in economy class are rating the quality as
very good. Similarly, which means that of the
customers in business class are rating the quality as good and so on.
The conditional frequency distribution of the “travel class” (X) of
passengers given the “overall rating of flight quality” (Y) is obtained by
. For example, gives which
means that of the passengers who rated the flight to be good
travelled in economy class, of the passengers who rated
the flight to be good travelled in business class and of
the passengers who rated the flight to be good travelled in first class.
In total, we have 100 customers and hence

Alternatively, we can summarize X and Y using the relative frequencies as


follows:

Overall rating of flight quality


Poor Fair Good Very good Total (rows)
Travel class Economy

Business 0

First 0 0

Total (columns) 1

To produce the frequency table without the marginal distributions, we can


use the R command table(X,Y). To obtain the full contingency table
including the marginal distributions in R, one can use the function
addmargins(). For the relative frequencies, the function prop.table()
can be used. In summary, a full contingency table is obtained by using

4.1.3 Graphical Representation of Two Nominal or


Ordinal Variables
Bar charts (see Sect. 2.3.1) can be used to graphically summarize the association
between two nominal or two ordinal variables. The bar chart is drawn for X and
the categories of Y are represented by separated bars or stacked bars for each
category of X. In this way, we summarize the joint distribution of the
contingency table.

Example 4.1.3
Consider Example 4.1.2. There are 62 passengers flying in the economy class.
From these 62 passengers, 10 rated the quality of the flight as poor, 33 as fair, 15
as good, and 4 as very good. This means for , we can either
place 4 bars next to each other, as in Fig. 4.1a, or we can stack them on top of
each other, as in Fig. 4.1b. The same can be done for the other categories of X,
see Fig. 4.1. Stacked and stratified bar charts are prepared in R by calling the
library lattice and using the function bar chart. In detail, one needs to
specify:
Fig. 4.1 Bar charts for travel class and rating of satisfaction

Remark 4.1.2
There are several other options in R to specify stratified bar charts. We refer the
interested reader to Exercise 2.6 to explore how the R package ggplot2 can be
used to make such graphics. Sometimes it can also be useful to visualize the
difference of two variables and not stack or stratify the bars, see Exercise 2.1.

Independence and Expected Frequencies An important statistical concept is


independence. In this section, we touch upon its descriptive aspects, see
Chaps. 6 (Sect. 6.5) and 7 (Sect. 7.5) for more theoretical details. Two variables
are considered to be independent if the observations on one variable do not
influence the observations on the other variable. For example, suppose two
different persons roll a die separately; then, the outcomes of their rolls do not
depend on each other. So we can say that the two observations are independent.
In the context of contingency tables, two variables are independent of each other
when the joint relative frequency equals the product of the marginal relative
frequencies of the two variables, i.e. the following equation holds:
(4.2)
The expected absolute frequencies under independence are obtained by

(4.3)
Note that the absolute frequencies are always integers but the expected absolute
frequencies may not always be integers.

Example 4.1.4
Recall Example 4.1.2. The expected absolute frequencies for the contingency
table can be calculated using (4.3). For example,

Table 4.4 lists both the observed absolute frequency and expected absolute
frequency (in brackets).
To calculate the expected absolute frequencies in R, we can access the
“expected” object returned from a -test applied to the respective contingency
table as follows:
A detailed motivation and explanation of this command is given in Sect. 10.
8.
Table 4.4 Observed and expected absolute frequencies for the airline survey
Overall rating of flight quality
Poor Fair Good Very good Total
Travel Economy 10 (6.2) 33 (22.32) 15 (24.8) 4 (8.68) 62
class Business 0 (2.5) 3 (9.0) 20 (10.0) 2 (3.5) 25
First 0 (1.3) 0 (4.68) 5 (5.2) 8 (1.82) 13
Total 10 36 40 14 100

4.2 Measures of Association for Two Discrete Variables


When two variables are not independent, then they are associated. Their
association can be weak or strong. Now we describe some popular measures of
association. Measures of association describe the degree of association between
two variables and can have a direction as well. Note that if variables are defined
on a nominal scale, then nothing can be said about the direction of association,
only about the strength.
Let us first consider a contingency table which is a special case of a
contingency table, see Table 4.5.

Table 4.5 contingency table

Y
Total (row)

a b

X c d

Total (column) n

The variables X and Y are independent if

(4.4)
or equivalently if
(4.5)
Note that some other forms of the conditions (4.4)–(4.5) can also be derived
in terms of a, b, c, and d.

Example 4.2.1
Suppose a vaccination against flu (influenza) is given to 200 persons. Some of
the persons may get affected by flu despite the vaccination. The data is
summarized in Table 4.6. Using the notations of Table 4.5, we have
, and thus, which is
less than . Hence, being affected by flu is not independent of the
vaccination, i.e. whether one is vaccinated or not has an influence on getting
affected by flu. In the vaccinated group, only 10 of 100 persons are affected by
flu while in the group not vaccinated 60 of 100 persons are affected. Another
interpretation is that if independence holds, then we would expect 65 persons to
be not affected by flu in the vaccinated group but we observe 90 persons. This
shows that vaccination has a protective effect.

Table 4.6 contingency table

Persons
Not affected Affected Total (row)
Vaccinated 90 10 100
Vaccination Not vaccinated 40 60 100
Total (column) 130 70 200

To gain a better understanding about the strength of association between two


variables, we need to develop the concept of dependence and independence
further. The following three subsections illustrate this in more detail.

4.2.1 Pearson’s Statistic


We now introduce Pearson’s statistic which is used for measuring the
association between variables in a contingency table and plays an important role
in the construction of statistical tests, see Sect. 10.8. The statistic or
coefficient for a contingency table is given as
(4.6)

A simpler formula for contingency tables is

(4.7)
The idea behind the coefficient is that when the relationship between two
variables is stronger, then the deviations between observed and expected
frequencies are expected to be higher (because the expected frequencies are
calculated assuming independence) and this indicates a stronger relationship
between the two variables. If observed and expected frequencies are identical or
similar, then this is an indication that the association between the two variables is
weak and the variables may even be independent. The statistic for a
contingency table sums up all the differences between the observed and expected
frequencies, squares them, and scales them with respect to the expected
frequencies. The squaring of the difference makes the statistic independent of the
positive and negative signs of the difference between observed and expected
frequencies. The range of values for is

(4.8)
Note that is the minimum function and simply returns the smaller of the
two numbers k and l. For example, returns the value 3. Consequently
the values of obtained from (4.6) can be compared with the range from (4.8).
A value of close to zero indicates a weak association and a value of close
to indicates a strong association between the two variables. Note
that the range of depends on n, k and l, i.e. the sample size and the dimension
of the contingency table.
The statistic is a symmetric measure in the sense that its value does not
depend on which variable is defined as X and which as Y.

Example 4.2.2
Consider Examples 4.1.2 and 4.1.4. Using the values from Table 4.4, we can
calculate the statistic as
The maximum possible value for the statistic is . Thus,
indicates a moderate association between “travel class” and “overall
rating of flight quality” of the passengers. In R, we obtain this result as follows:

4.2.2 Cramer’s V Statistic


A problem with Pearson’s coefficient is that the range of its maximum value
depends on the sample size and the size of the contingency table. These values
may vary in different situations. To overcome this problem, the coefficient can
be standardized to lie between 0 and 1 so that it is independent of the sample size
as well as the dimension of the contingency table. Since was the
maximal value of the statistic, dividing by this maximal value
automatically leads to a scaled version with maximal value 1. This idea is used
by Cramer’s V statistic which for a contingency table is given by

(4.9)
The closer the value of V gets to 1, the stronger the association between the two
variables.

Example 4.2.3
Consider Example 4.2.2. The obtained statistic is 57.95064. To obtain
Cramer’s V, we just need to calculate

(4.10)
This indicates a moderate association between “travel class” and “overall rating
of flight quality” because 0.54 lies in the middle of 0 and 1. In R, there are two
options to calculate V: (i) to calculate the statistic and then adjust it as in
(4.9), (ii) to use the functions assocstats and xtabs contained in the
package vcd as follows:
4.2.3 Contingency Coefficient C
Another option to standardize is given by a corrected version of Pearson’s
contingency coefficient:

(4.11)
with

(4.12)
It always lies between 0 and 1. The closer the value of C is to 1, the stronger the
association.

Example 4.2.4
We know from Example 4.2.2 that the statistic for travel class and satisfaction
level is 57.95064. To calculate , we need the following calculations:

There is a moderate to strong association between “travel class” and “overall


rating of flight quality” of the passengers. We can compute C in R using the vcd
package as follows:

4.2.4 Relative Risks and Odds Ratios


We now introduce the concepts of odds ratios and relative risks . Consider a
contingency table as introduced in Table 4.5. Now suppose we have two
variables X and Y with their conditional distributions and . In the
context of a contingency table, , , ,
and . The relative risks are defined as the ratio of two conditional
distributions, for example

(4.13)
The odds ratio is defined as the ratio of these relative risks from (4.13) as

(4.14)
Alternatively, the odds ratio can be defined as the ratio of the chances for
“disease”, a / b (number of smokers with the disease divided by the number of
non-smokers with the disease), and no disease, c / d (number of smokers with no
disease divided by the number of non-smokers with no disease).
The relative risks compare proportions, while the odds ratio compares odds.

Example 4.2.5
A classical example refers to the possible association of smoking with a
particular disease. Consider the following data on 240 individuals:

Smoking Total (row)


Yes No
Disease Yes 34 66 100
No 22 118 140
Total (column) 56 184 240

We calculate the following relative risks:

(4.15)
Thus, the proportion of individuals with the disease is 1.69 times higher
among smokers when compared with non-smokers. Similarly, the proportion of
healthy individuals is 0.61 times smaller among smokers when compared with
non-smokers.
The relative risks are calculated to compare the proportion of sick or healthy
patients between smokers and non-smokers. Using these two relative risks, the
odds ratio is obtained as

We can interpret this outcome as follows: (i) the chances of smoking are 2.76
times higher for individuals with the disease compared with healthy individuals
(follows from definition (4.14)). We can also say that (ii) the chances of having
the particular disease is 2.76 times higher for smokers compared with non-
smokers. If we interchange either one of the “Yes” and “No” columns or the
“Yes” and “No” rows, we obtain , giving us further
interpretations: (iii) the chances of smoking are 0.36 times lower for individuals
without disease compared with individuals with the disease, and (iv) the chance
of having the particular disease is 0.36 times lower for non-smokers compared
with smokers. Note that all four interpretations are correct and one needs to
choose the right interpretation in the light of the experimental situation and the
question of interest.

4.3 Association Between Ordinal and Continuous


Variables
4.3.1 Graphical Representation of Two Continuous
Variables
A simple way to graphically summarize the association between two continuous
variables is to plot the paired observations of the two variables in a two-
dimensional coordinate system. If n paired observations for two continuous
variables X and Y are available as , then all such observations
can be plotted in a single graph. This graph is called a scatter plot . Such a plot
reveals possible relationships and trends between the two variables. For
example, Figs. 4.2 and 4.3 show scatter plots with six different types of
association.
Fig. 4.2 Scatter plots
Fig. 4.3 Continues Fig. 4.2—more scatter plots

Figure 4.2a shows increasing values of Y for increasing values of X. We call


this relationship positive association. The relationship between X and Y is
nearly linear because all the points lie around a straight line.
Figure 4.2b shows decreasing values of Y for increasing values of X. We
call this relationship negative association.
Figure 4.2c tells us the same as Fig. 4.2a, except that the positive
association is weaker.
Figure 4.2d tells us the same as Fig. 4.2b, except that the negative
association is weaker.
Figure 4.3a shows that as the X-values increase, the values of Y neither
increase nor decrease. This indicates that there is no clear relationship
between X and Y and highlights the lack of association between X and Y.
Figure 4.3b illustrates a nonlinear relationship between X- and Y-values.

Example 4.3.1
To explore the possible relationship between the overall number of tweets with
the number of followers on Twitter, we take a sample of 10 prime ministers and
heads of state in different countries as of June 2014 and obtain the following
data:

Name Tweets Followers


Angela Merkel 25 7194
Barack Obama 11,800 43,400,000
Jacob Zuma 99 324,000
Dilma Rousseff 1934 2,330,000
Sauli Niinistö 199 39,000
Vladimir Putin 2539 189,000
Francois Hollande 4334 639,000
David Cameron 952 688,000
Enrique P. Nieto 3245 2,690,000
John Key 2468 110,000

The tweets are denoted by and the followers are denoted by


We plot paired observations into a cartesian coordinate
system. For example, we plot for Angela Merkel,
for Barack Obama, and so on. Figure 4.4 shows the
scatter plot for the number of tweets and the number of followers (on a log-
scale).
One can see that there is a positive association between the number of tweets
and the number of followers. This does, however, not imply a causal
relationship: it is not necessarily because someone tweets more he/she has more
followers or because someone has more followers he/she tweets more; the
scatter plot just describes that those with more tweets have more followers. In R,
we produce this scatter plot by the plot command:
Fig. 4.4 Scatter plot between tweets and followers

4.3.2 Correlation Coefficient


Suppose two variables X and Y are measured on a continuous scale and are
linearly related like where a and b are constant values. The
correlation coefficient measures the degree of linear relationship
between X and Y using

(4.16)
with

(4.17)
and

(4.18)
Karl Pearson (1857–1936) presented the first rigorous treatment of correlation
and acknowledged Auguste Bravais (1811–1863) for ascertaining the initial
mathematical formulae for correlation. This is why the correlation coefficient is
also known as the Bravais–Pearson correlation coefficient.
The correlation coefficient is independent of the units of measurement of X
and Y. For example, if someone measures the height and weight in metres and
kilograms respectively and another person measures them in centimetres and
grams, respectively, then the correlation coefficient between the two sets of data
will be the same. The correlation coefficient is symmetric, i.e. .
The limits of r are . If all the points in a scatter plot lie exactly on a
straight line, then the linear relationship between X and Y is perfect and ,
see also Exercise 4.7. If the relationship between X and Y is (i) perfectly linear
and increasing, then and (ii) perfectly linear and decreasing, then .
The signs of r thus determine the direction of the association. If r is close to
zero, then it indicates that the variables are independent or the relationship is not
linear. Note that if the relationship between X and Y is nonlinear, then the degree
of linear relationship may be low and r is then close to zero even if the variables
are clearly not independent. Note that and .

Example 4.3.2
Look again at the scatter plots in Figs. 4.2 and 4.3. We observe strong positive
linear correlation in Fig. 4.2a ( ), strong negative linear correlation in
Fig. 4.2b ( ), moderate positive linear correlation in Fig. 4.2c ( ),
moderate negative linear association in Fig. 4.2d ( ), no visible
correlation in Fig. 4.3a ( ), and strong nonlinear (but not so strong linear)
correlation in Fig. 4.3b ( ).

Table 4.7 Results of 100-m race and long jump of 5 athletes

Roman Sebrle 10.85 7.84


Bryan Clay 10.44 7.96
Dmitriy Karpov 10.50 7.81
Dean Macey 10.89 7.47
Chiel Warners 10.62 7.74

Example 4.3.3
In a decathlon competition, a group of athletes are competing with each other in
10 different track and field events. Suppose we are interested in how the results
of the 100-m race relate to the results of the long jump competition. The
correlation coefficient for the 100-m race (X, in seconds) and the long jump
event (Y, in metres) for 5 athletes participating in the 2004 Olympic Games (see
also Appendix A.4) are listed in Table 4.7.
To calculate the correlation coefficient, we need the following summary
statistics:
The correlation coefficient therefore is

Since is negative, we can say that (i) there is a negative correlation


between the 100-m race and the long jump event, i.e., shorter running times
result in longer long jump results, and (ii) this association is moderate to strong.
In R, we can obtain the results (after attaching the data) as follows:

4.3.3 Spearman’s Rank Correlation Coefficient


Consider a situation where n objects are ranked with respect to two variables X
and Y. For instance, the variables could represent the opinion of two different
judges in a talent competition who rank the participants with respect to their
performance. This means that for each judge, the worst participant (with the
lowest score ) is assigned rank 1, the second worst participant (with the second
lowest score ) will receive rank 2, and so on. Thus, every participant has been
given two ranks by two different judges. Suppose we want to measure the degree
of association between the two different judgments; that is, the two different sets
of ranks. We expect that under perfect agreement, both the judges give the same
judgment in the sense that they give the same ranks to each candidate. However,
if they are not in perfect agreement, then there may be some variation in the
ranks assigned by them. To measure the degree of agreement, or, in general, the
degree of association, one can use Spearman’s rank correlation coefficient. As
the name says, this correlation coefficient uses only the ranks of the values and
not the values themselves. Thus, this measure is suitable for both ordinal and
continuous variables. We introduce the following notations: let denote the
rank of the observation on X, i.e. the rank among the ordered values of X.
Similarly, denotes the rank of the observation of y. The difference
between the two rank values is . Spearman’s rank correlation
coefficient is defined as

(4.19)
The values of R lie between and and measure the degree of correlation
between the ranks of X and Y. Note that it does not matter whether we choose an
ascending or descending order of the ranks, the value of R remains the same.
When all the observations are assigned exactly the same ranks, then and
when all the observations are assigned exactly the opposite ranks, then .

Example 4.3.4
Look again at the scatter plots in Figs. 4.2 and 4.3. We observe strong positive
correlation in Fig. 4.2a ( ), strong negative correlation in Fig. 4.2b (
), moderate positive correlation in Fig. 4.2c ( ), moderate
negative association in Fig. 4.2d ( ), no visible correlation in Fig. 4.3a (
), and strong nonlinear correlation in Fig. 4.3b ( ).

Example 4.3.5
Let us follow Example 4.3.3 a bit further and calculate Spearman’s rank
correlation coefficient for the first five observations of the decathlon data. Again
we list the results of the 100-m race (X) and the results of the long jump
competition (Y). In addition, we assign ranks to both X and Y. For example, the
shortest time receives rank 1, whereas the longest time receives rank 5.
Similarly, the shortest long jump result receives rank 1, the longest long jump
result receives rank 5.

Roman Sebrle 10.85 4 7.84 4 0 0


Bryan Clay 10.44 1 7.96 5 16
Dmitriy Karpov 10.50 2 7.81 3 1
Dean Macey 10.89 5 7.47 1 16
Chiel Warners 10.62 3 7.74 2 1
Total 34

Using (4.19), Spearman’s rank correlation coefficient can be calculated as


We therefore have a moderate to strong negative association between the
100-m race and the long jump event. We now know that for the 5 athletes above
longer running times relate to shorter jumping distances which in turn means that
a good performance in one discipline implies a good performance in the other
discipline. In R, we can obtain the same results by using the cor command:

If two or more observations take the same values for (or ), then there is a tie.
In such situations, the respective ranks can simply be averaged, though more
complicated solutions also exist (one of which is implemented in the R function
cor). For example, if in Example 4.3.5 Bryan Clay’s was 10.50 s instead of
10.44 s, then both Bryan Clay and Dmitriy Karpov had the same time. Instead of
assigning the ranks 1 and 2 to them, we assign the ranks 1.5 to each of them.
Table 4.8 Payment options and timeliness survey with 100 participating customers

Timeliness
Unsatisfied Satisfied Very satisfied Total
(1) (2) (3)
Payment options Not enough (1) 7 11 26 44
Enough (2) 10 15 31 56
Total 17 26 57 100

The differences between the correlation coefficient and the rank correlation
coefficient are manifold: firstly, Pearson’s correlation coefficient can be used for
continuous variables only, but not for nominal or ordinal variables. The rank
correlation coefficient can be used for either two continuous or two ordinal
variables or a combination of an ordinal and a continuous variable, but not for
two nominal variables. Moreover, the rank correlation coefficient responds to
any type of relationship whereas Pearson’s correlation measures the degree of a
linear relationship only—see also Fig. 4.3b. Another difference between the two
correlation coefficients is that Pearson uses the entire information contained in
the continuous data in contrast to the rank correlation coefficient which uses
only ordinal information contained in the ordered data.
4.3.4 Measures Using Discordant and Concordant Pairs
Another concept which uses ranks to measure the association between ordinal
variables is based on concordant and discordant observation pairs. It is best
illustrated by means of an example.

Example 4.3.6
Suppose an online book store conducts a survey on their customer’s satisfaction
with respect to both the timeliness of deliveries (X) and payment options (Y). Let
us consider the following contingency table with a summary of the
responses of 100 customers. We assume that the categories for both variables can
be ordered and ranks can be assigned to different categories, see the numbers in
brackets in Table 4.8. There are 100 observation pairs which summarize
the response of the customers with respect to both X and Y. For example, there
are 18 customers who were unsatisfied with the timeliness of the deliveries and
complained that there are not enough payment options. If we compare two
responses and , it might be possible that one customer is more
happy (or more unhappy) than the other customer with respect to both X and Y or
that one customer is more happy with respect to X but more unhappy with
respect to Y (or vice versa). If the former is the case, then this is a concordant
observation pair; if the latter is true, then it is a discordant pair. For instance, a
customer who replied “enough” and “satisfied” is more happy than a customer
who replied “not enough” and “unsatisfied” because he is more happy with
respect to both X and Y.

In general, a pair is
concordant if and (or and ),
discordant if and (or and ),
tied if (or ).
Obviously, if we have only concordant observations, then there is a strong
positive association because a higher value of X (in terms of the ranking) implies
a higher value of Y. However, if we have only discordant observations, then
there is a clear negative association. The measures which are introduced below
simply put the number of concordant and discordant pairs into relation. This idea
is reflected in Goodman and Kruskal’s which is defined as
(4.20)
where

describe the number of concordant and discordant observation pairs,


respectively. An alternative measure is Stuart’s given as

(4.21)
Both measures are standardized to lie between and 1, where larger values
indicate a stronger association and the sign indicates the direction of the
association.

Example 4.3.7
Consider Example 4.3.6. A customer who replied “enough” and “satisfied” is
more happy than a customer who replied “not enough” and “unsatisfied”
because the observation pairs, using ranks, are (2, 2) and (1, 1) and therefore
and . There are such pairs. Similarly those who said “not
enough” and “unsatisfied” are less happy than those who said “enough” and
“very satisfied” ( pairs). Table 4.5 summarizes the comparisons in detail.
Table 4.5a shows that is concordant to
and and tied to
, , and
. Thus for these comparisons, we have 0
discordant pairs, concordant pairs and tied
pairs. Table 4.5b–f show how the task can be completed. While tiresome,
systematically working through the table (and making sure to not count pairs
more than once) yields

As a visual rule of thumb, working from the top left to the bottom right
yields the concordant pairs; and working from the bottom left to the top right
yields the discordant pairs. It follows that
which indicates no clear relationship between the two variables. A similar result
is obtained using which is . This rather lengthy task
can be made much quicker by using the ord.gamma and ord.tau commands
from the R library ryouready :

Fig. 4.5 Scheme to visualize concordant (c), discordant (d), and tied (t) pairs in a contingency table

4.4 Visualization of Variables from Different Scales


If we want to jointly visualize the association between a variable X, which is
either nominal or ordinal and another variable Y, which is continuous, then we
can use any graph which is suitable for the continuous variable (see Chaps. 2 and
3) and produce it for each category of the nominal/ordinal variable. We
recommend using stratified box plots or stratified ECDF’s, as they are easy to
read when summarized in a single figure; however, it is also possible to place
histograms next to each other or on top of each other, or overlay kernel density
plots, but we do not illustrate this here in more detail.

Example 4.4.1
Consider again our pizza delivery example (Appendix A.4). If we are interested
in the pizza delivery times by branch, we may simply plot the box plots and
ECDF’s of delivery time by branch. Figure 4.6 shows that the shortest delivery
times can be observed in the branch in the East. Producing these graphs in R is
straightforward: The boxplot command can be used for two variables by
separating them with the sign. For the ECDF, we have to produce a plot for
each branch and overlay them with the “add=TRUE” option.
Fig. 4.6 Distribution of pizza delivery time stratified by branch

4.5 Key Points and Further Issues


4.6 Exercises
Exercise 4.1
A newspaper asks two of its staff to review the coffee quality at different trendy
cafés. The coffee can be rated on a scale from 1 (miserable) to 10 (excellent).
The results of the two coffee enthusiasts X and Y are as follows:

Café i

1 3 6
2 8 7
3 7 10
4 9 8
5 5 4

(a) Calculate and interpret Spearman’s rank correlation coefficient.

(b) Does Spearman’s R differ depending on whether ranks are assigned in a


decreasing or increasing order?

(c) Suppose the coffee can only be rated as either good (>5) or bad ( ). Do
the chances of a good rating differ between the two journalists?

Exercise 4.2
A total of 150 customers of a petrol station are asked about their satisfaction with
their car and motorbike insurance. The results are summarized below:

Satisfied Unsatisfied Total


Car 33 25 58
Car (diesel engine) 29 31 60
Motorbike 12 20 32
Total 74 76 150

(a) Determine and interpret Pearson’s statistic, Cramer’s V, and .

(b) Combine the categories “car” and “car (diesel engine)” and produce the
corresponding table. Calculate as efficiently as possible and give a
meaningful interpretation of the odds ratio.

(c) Compare the results from (a) and (b).

Exercise 4.3
There has been a big debate about the usefulness of speed limits on public roads.
Consider the following table which lists the speed limits for country roads (in
miles/h) and traffic deaths (per 100 million km) for different countries in 1986
when the debate was particularly serious:

Country Speed limit Traffic deaths


Denmark 55 4.1
Japan 55 4.7
Canada 60 4.3
Netherlands 60 5.1
Italy 75 6.1

(a) Draw the scatter plot for the two variables.

(b) Calculate the Bravais–Pearson and Spearman correlation coefficients.

(c) What are the effects on the correlation coefficients if the speed limit is
given in km/h rather than miles/h (1 mile/h 1.61 km/h)?

(d) Consider one more observation: the speed limit for England was 70 miles/h
and the death rate was 3.1.

(i) Add this observation to the scatter plot.

(ii) Calculate the Bravais–Pearson correlation coefficient given this


additional observation.

Exercise 4.4
The famous passenger liner Titanic hit an iceberg in 1912 and sank. A total of
337 passengers travelled in first class, 285 in second class, and 721 in third class.
In addition, there were 885 staff members on board. Not all passengers could be
rescued. Only the following were rescued: 135 from the first class, 160 from the
second class, 541 from the third class and 674 staff.

(a) Determine and interpret the contingency table for the variables “travel
class” and “rescue status”.

(b) Use a contingency table to summarize the conditional relative frequency


distributions of rescue status given travel class. Could there be an
association of the two variables?

(c) What would the contingency table from (a) look like under the
independence assumption? Calculate Cramer’s V statistic. Is there any
association between travel class and rescue status?

(d) Combine the categories “first class” and “second class” as well as “third
class” and “staff”. Create a contingency table based on these new
categories. Determine and interpret Cramer’s V, the odds ratio, and relative
risks of your choice.

(e) Given the results from (a) to (d), what are your conclusions?

Fig. 4.7 Temperature and hotel occupancy for the different cities
Exercise 4.5
To study the association of the monthly average temperature (in , X) and hotel
occupation (in , Y), we consider data from three cities: Polenca (Mallorca,
Spain) as a summer holiday destination, Davos (Switzerland) as a winter skiing
destination, and Basel (Switzerland) as a business destination.

Months Davos Polenca Basel


X Y X Y X Y
Jan 6 91 10 13 1 23
Feb 5 89 10 21 0 82
Mar 2 76 14 42 5 40
Apr 4 52 17 64 9 45
May 7 42 22 79 14 39
Jun 15 36 24 81 20 43
Jul 17 37 26 86 23 50
Aug 19 39 27 92 24 95
Sep 13 26 22 36 21 64
Oct 9 27 19 23 14 78
Nov 4 68 14 13 9 9
Dec 0 92 12 41 4 12

(a) Calculate the Bravais–Pearson correlation coefficient. The following


summary statistics are available: , , ,
, and .

(b) Interpret the scatter plot in Fig. 4.7 which visualizes temperature and hotel
occupancy for Davos (D), Polenca (P), and Basel (B).

(c) Use R to calculate the correlation coefficient separately for each city.
Interpret the results and discuss the use of the correlation coefficient if
more than two variables are available.

Exercise 4.6
Consider a neighbourhood survey on the use of a local park. Respondents were
asked whether the park may be used for summer music concerts and whether
dog owners should put their dogs on a lead. The results are summarized in the
following contingency table:

Put dogs on a lead


Agree No opinion Disagree Total
Use for concerts Agree 82 4 0 86
No opinion 8 43 9 60
Disagree 0 2 10 12
Total 90 49 19 158

(a) Calculate and interpret Goodman and Kruskal’s .

(b) Now ignore the ordinal structure of the data and calculate Cramer’s V.

(c) Create the contingency table which is obtained when the categories “no
opinion” and “agree” are combined.

(d) What is the relative risk of disagreement with summer concerts depending
on the opinion about using leads?

(e) Calculate the odds ratio and offer two interpretations of it.

(f) Determine for the table calculated in (c).

(g) What is your final interpretation and what may be the best measure to use
in this example?

Exercise 4.7
Consider n observations for which , , holds. Show that .

Exercise 4.8
Make yourself familiar with the Olympic decathlon data described in
Appendix A.4. Read in and attach the data in R.

(a) Use R to calculate and interpret the Bravais–Pearson correlation coefficient


between the results of the discus and the shot-put events.

(b) There are 10 continuous variables. How many different correlation


coefficients can you calculate? How would you summarize them?

(c) Apply the cor command to the whole data and interpret the output.

(d) Omit the two rows which contain missing data and interpret the output
again.

Exercise 4.9
We are interested in the pizza delivery data which is described in Appendix A.4.

(a) Read in the data and create two new binary variables which describe
whether a pizza was hot ( C) and the delivery time was short (
min). Create a contingency table for the two new variables.

(b) Calculate and interpret the odds ratio for the contingency table from (a).

(c) Use Cramer’s V, Stuart’s , Goodman and Kruskal’s , and a stacked bar
chart to explore the association between the categorical time and
temperature variables.

(d) Draw a scatter plot for the continuous time and temperature variables.
Determine both the Bravais–Pearson and Spearman correlation
coefficients.

(e) Use methods of your choice to explore the relationship between


temperature and driver, operator, number of ordered pizzas and bill. Is it
clear which of the variables influence the pizza temperature?
Part II
Probability Calculus
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_5

5. Combinatorics
Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

5.1 Introduction
Combinatorics is a special branch of mathematics. It has many applications not
only in several interesting fields such as enumerative combinatorics (the
classical application), but also in other fields, for example in graph theory and
optimization.
First, we try to motivate and understand the role of combinatorics in
statistics. Consider a simple example in which someone goes to a cafe. The
person would like a hot beverage and a cake. Assume that one can choose
among three different beverages, for example cappuccino, hot chocolate, and
green tea, and three different cakes, let us say carrot cake, chocolate cake, and
lemon tart. The person may consider different beverage and cake combinations
when placing the order, for example carrot cake and cappuccino, carrot cake and
tea, and hot chocolate and lemon tart. From a statistical perspective, the
customer is evaluating the possible combinations before making a decision.
Depending on their preferences, the order will be placed by choosing one of the
combinations.
In this example, it is easy to calculate the number of possible combinations.
There are three different beverages and three different cakes to choose from,
leading to nine different ( ) beverage and cake combinations. However,
suppose there is a choice of 15 hot beverages and 8 different cakes. How many
orders can be made? (Answer: ) What if the person decides to order two
cakes, how will it affect the number of possible combinations of choices? It will
be a tedious task to count all the possibilities. So we need a systematic approach
to count such possible combinations. Combinatorics deals with the counting of
different possibilities in a systematic approach.
People often use the urn model to understand the system in the counting
process. The urn model deals with the drawing of balls from an urn. The balls in
the urn represent the units of a population, or the features of a population. The
balls may vary in colour or size to represent specific properties of a unit or
feature. We illustrate this concept in more detail in Fig. 5.1.

Fig. 5.1 a Representation of the urn model. Drawing from the urn model b with replacement and c without
replacement. Compositions of three drawn balls: d all balls are distinguishable and e some balls are not
distinguishable

Suppose there are 5 balls of three different colours—two black, one grey, and
two white (see Fig. 5.1a). This can be generalized to a situation in which there
are n balls in the urn and we want to draw m balls. Suppose we want to know
how many different possibilities exist to draw m out of n balls (thus
determining the number of distinguishable combinations).
To deal with such a question, we first need to decide whether a ball will be
put back into the urn after it is drawn or not. Figure 5.1b illustrates that a grey
ball is drawn from the urn and then placed back (illustrated by the two-headed
arrow). We say the ball is drawn with replacement. Figure 5.1c illustrates a
different situation in which the grey ball is drawn from the urn and is not placed
back into the urn (illustrated by the one-headed arrow). We say the ball is drawn
without replacement.
Further, we may be interested in knowing the
total number of ways in which the chosen set of balls can be arranged in a
distinguishable order (which we will define as permutations later in this
chapter).
To answer the question how many permutations exist, we first need to decide
whether all the chosen balls are distinguishable from each other or not. For
example, in Fig. 5.1d, the three chosen balls have different colours; therefore,
they are distinguishable. There are many options on how they can be arranged.
In contrast, some of the chosen balls in Fig. 5.1e are the same colour, they are
therefore not distinguishable. Consequently, the number of combinations is
much more limited. The concept of balls and urns just represents the features of
observations from a sample. We illustrate this in more detail in the following
example.

Example 5.1.1
Say a father promises his daughter three scoops of ice cream if she cleans up her
room. For simplicity, let us assume the daughter has a choice of four flavours:
chocolate, banana, cherry, and lemon. How many different choices does the
daughter have? If each scoop has to be a different flavour she obviously has
much less choice than if the scoops can have the same flavour. In the urn model,
this is represented by the concept of “with/without replacement”. The urn
contains 4 balls of 4 different colours which represent the ice cream flavours.
For each of the three scoops, a ball is drawn to determine the flavour. If we draw
with replacement, each flavour can be potentially chosen multiple times;
however, if we draw without replacement each flavour can be chosen only once.
Then, the number of possible combinations is easy to calculate: it is 4, i.e.
(chocolate, banana, and cherry); (chocolate, banana, and lemon); (chocolate,
cherry, and lemon); and (banana, cherry, and lemon). But what if we have more
choices? Or if we can draw flavours multiple times? We then need calculation
rules which help us counting the number of options.
Now, let us assume that the daughter picked the flavours (chocolate [C],
banana [B], and lemon [L]). Like many other children, she prefers to eat her
most favourite flavour (chocolate) last, and her least favourite flavour (cherry)
first. Therefore, the order in which the scoops are placed on top of the cone are
important! In how many different ways can the scoops be placed on top of the
cone? This relates to the question of the number of distinguishable permutations.
The answer is 6: (C,B,L)–(C,L,B)–(B,L,C)–(B,C,L)–(L,B,C)–(L,C,B). But what
if the daughter did pick a flavour multiple times, e.g. (chocolate, chocolate,
lemon)? Since the two chocolate scoops are non-distinguishable, there are fewer
permutations: (chocolate, chocolate, and lemon)–(chocolate, lemon, and
chocolate)–(lemon, chocolate, and chocolate).
The bottom line of this example is that the number of combinations/options
is determined by (i) whether we draw with or without replacement (i.e. allow
flavours to be chosen more than once) and (ii) whether the arrangement in a
particular order (=permutation) is of any specific interest.

Consider the urn example again. Suppose three balls of different colours, black,
grey, and white, are drawn. Now there are two options: The first option is to take
into account the order in which the balls are drawn. In such a situation, two
possible sets of balls such as (black, grey, and white) and (white, black, and
grey) constitute two different sets. Such a set is called an ordered set. In the
second option, we do not take into account the order in which the balls are
drawn. In such a situation, the two possible sets of balls such as (black, grey, and
white) and (white, black, and grey) are the same sets and constitute an unordered
set of balls.

Definition 5.1.1
A group of elements is said to be ordered if the order in which these elements
are drawn is of relevance. Otherwise, it is called unordered .

Examples.
Ordered samples:
– The first three places in an Olympic 100 m race are determined by
the order in which the athletes arrive at the finishing line. If 8
athletes are competing with each other, the number of possible
results for the first three places is of interest. In the urn language, we
are taking draws without replacement (since every athlete can only
have one distinct place).
– In a raffle with two prizes, the first drawn raffle ticket gets the first
prize and the second raffle ticket gets the second prize.
– There exist various esoteric tarot card games which claim to foretell
someone’s fortune with respect to several aspects of life. The order
in which the cards are shown on the table is important for the
interpretation.
Unordered samples:
– The selected members for a national football team. The order in
which the selected names are announced is irrelevant.
– Out of 10 economists, 10 medical doctors, and 10 statisticians, an
advisory committee consisting of 4 economists, 3 medical doctors,
and 2 statisticians is elected.
– Fishing 20 fish from a lake.
– A bunch of 10 flowers made from 21 flowers of 4 different colours.

Definition 5.1.2
The factorial function n ! is defined as

(5.1)

Example 5.1.2
It follows from the definition of the factorial function that

This can be calculated in R as follows:

5.2 Permutations
Definition 5.2.1
Consider a set of n elements. Each ordered composition of these n elements is
called a permutation.

We distinguish between two cases: If all the elements are distinguishable, then
we speak of permutation without replacement . However, if some or all of the
elements are not distinguishable, then we speak of permutation with replacement
. Please note that the meaning of “replacement” here is just a convention and
does not directly refer to the drawings, e.g. from the urn model considered in
Example 5.1.1.

5.2.1 Permutations without Replacement


If all the n elements are distinguishable, then there are
(5.2)
different compositions of these elements.

Example 5.2.1
There were three candidate cities for hosting the 2020 Olympic Games: Tokyo
(T), Istanbul (I), and Madrid (M). Before the election, there were possible
outcomes, regarding the final rankings of the cities:

5.2.2 Permutations with Replacement


Assume that not all n elements are distinguishable. The elements are divided into
groups, and these groups are distinguishable. Suppose, there are s groups of sizes
. The total number of different ways to arrange the n elements in s
groups is:

(5.3)

Example 5.2.2
Consider the data in Fig. 5.1e. There are two groups consisting of two black
balls and one white ball . So there are the following three
possible combinations to arrange the balls: (black, black, and white), (black,
white, and black), and (white, black, and black). This can be determined by
calculating

5.3 Combinations
Definition 5.3.1
The Binomial coefficient for any integers m and n with is denoted and
defined as

(5.4)
It is read as “n choose m” and can be calculated in R using the following
command:

There are several calculation rules for the binomial coefficient:

(5.5)
We now answer the question of how many different possibilities exist to draw m
out of n elements, i.e. m out of n balls from an urn. It is necessary to distinguish
between the following four cases:

(1) Combinations without replacement and without consideration of the order


of the elements.

(2) Combinations without replacement and with consideration of the order of


the elements.

(3) Combinations with replacement and without consideration of the order of


the elements.

(4) Combinations with replacement and with consideration of the order of the
elements.

5.3.1 Combinations without Replacement and without


Consideration of the Order
When there is no replacement and the order of the elements is also not relevant,
then the total number of distinguishable combinations in drawing m out of n
elements is
(5.6)

Example 5.3.1
Suppose a company elects a new board of directors. The board consists of 5
members and 15 people are eligible to be elected. How many combinations for
the board of directors exist? Since a person cannot be elected twice, we have a
situation where there is no replacement. The order is also of no importance:
either one is elected or not. We can thus apply (5.6) which yields

possible combinations. This result can be obtained in R by using the command


choose(15,5).

5.3.2 Combinations without Replacement and with


Consideration of the Order
The total number of different combinations for the setting without replacement
and with consideration of the order is

(5.7)

Example 5.3.2
Consider a horse race with 12 horses. A possible bet is to forecast the winner of
the race, the second horse of the race, and the third horse of the race. The total
number of different combinations for the horses in the first three places is

This result can be explained intuitively: for the first place, there is a choice of 12
different horses. For the second place, there is a choice of 11 different horses (12
horses minus the winner). For the third place, there is a choice of 10 different
horses (12 horses minus the first and second horses). The total number of
combinations is the product . This can be calculated in R as follows:
5.3.3 Combinations with Replacement and without
Consideration of the Order
The total number of different combinations with replacement and without
consideration of the order is

(5.8)
Note that these are the two representations which follow from the definition of
the binomial coefficient but typically only the first representation is used in
textbooks. We will motivate the second representation after Example 5.3.3.

Example 5.3.3
A farmer has 2 fields and aspires to cultivate one out of 4 different organic
products per field. Then, the total number of choices he has is

(5.9)
If 4 different organic products are denoted as a, b, c, and d, then the following
combinations are possible:

Please note that, for example, (a,b) is identical to (b,a) because the order in
which the products a and b are cultivated on the first or second field is not
important in this example.

We now try to give an intuitive explanation of formula (5.9) using Example


5.3.3. We have products and fields and apply the following technical
“trick”: we sort the combinations by the product symbols (a, b, c, or d). When
we switch from one product to the next (e.g. from b to c), we make a note by
adding a vertical line |. Whenever a product is skipped, we add a line too. For
example, the combination (a, c) is denoted by a||c|, the combination (d, d) by
|||dd, (c, c) by ||cc|, and (a, a) by aa|||. Therefore, the number of characters
equates to the 2 chosen symbols of the set (a, b, c, d) plus the 3 vertical lines, in
summary places where places are selected for the vertical
line |. How many different line/letter combinations exist? There are 3 out of 5
possible positions for |, i.e. possible combinations, and this is nothing

but the right-hand side of (5.9).

5.3.4 Combinations with Replacement and with


Consideration of the Order
The total number of different combinations for the integers m and n with
replacement and when the order is of relevance is
(5.10)

Example 5.3.4
Consider a credit card with a four-digit personal identification number (PIN)
code. The total number of possible combinations for the PIN is

Note that every digit in the first, second, third, and fourth places can be
chosen out of ten digits from 0 to 9 .

5.4 Key Points and Further Issues


5.5 Exercises
Exercise 5.1
At a party with 10 guests, every guest shakes hands with each other guest. How
many handshakes can be counted in total?

Exercise 5.2
A language teacher is concerned about the vocabularies of his students. He thus
tests 5 students in each lecture. What are the total number of possible
combinations

(a) if a student is tested only once per lecture and

(b) if a student is tested more than once per lecture?


Use R to quantify numbers which you cannot calculate manually.

Exercise 5.3
“Gobang” is a popular game in which two players set counters on a board with
381 knots. One needs to place 5 consecutive counters in a row to win the game.
There are also rules on how to remove counters from the other player. Consider a
match where 64 counters have already been placed on the board. How many
possible combinations exist to place 64 counters on the board?

Exercise 5.4
A shop offers a special tray of beer: “Munich’s favourites”. Customers are
allowed to fill the tray, which holds 20 bottles, with any combination of
Munich’s 6 most popular beers (from 6 different breweries).

(a) What are the number of possible combinations to fill the tray?

(b) A customer insists of having at least one beer from each brewery in his
tray. How many options does he have to fill the tray?

Exercise 5.5
The FIFA World Cup 2018 in Russia consists of 32 teams. How many
combinations for the top 3 teams exist when

(a) taking into account the order of these top 3 teams and

(b) without taking into account the order of these top 3 teams?

Exercise 5.6
An online book store assigns membership codes to each member. For
administrative reasons, these codes consist of four letters between “A” and “L”.
A special discount period increased the total number of members from 18, 200 to
20, 500. Are there enough combinations of codes left to be assigned for the new
membership codes?

Exercise 5.7
In the old scoring system of ice skating (valid until 2004), each member of a jury
of 9 people judged the performance of the skaters on a scale between 0 and 6. It
was a decimal scale and thus scores such as 5.1 and 5.2 were possible. Calculate
the number of possible score combinations from the jury.

Exercise 5.8
It is possible in Pascal’s triangle (Fig. 5.2, left) to view each entry as the sum of
the two entries directly above it. For example, the 3 on the fourth line from the
top is the sum of the 1 and 2 above the 3. Another interpretation refers to a
geometric representation of the binomial coefficient, (Fig. 5.2, right) with

being the column index and being the row index.

Fig. 5.2 Excerpt from Pascal’s triangle (left) and its representation by means of binomial coefficients
(right)

(a) Show that each entry in the bold third diagonal line can be represented via
.

(b) Now show that the sum of two consecutive entries in the bold third
diagonal line always corresponds to quadratic numbers.
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_6

6. Elements of Probability Theory


Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

Let us first consider some simple examples to understand the need for
probability theory. Often one needs to make a decision whether to carry an
umbrella or not when leaving the house; a company might wonder whether to
introduce a new advertisement to possibly increase sales or to continue with their
current advertisement; or someone may want to choose a restaurant based on
where he can get his favourite dish. In all these situations, randomness is
involved. For example, the decision of whether to carry an umbrella or not is
based on the possibility or chance of rain. The sales of the company may
increase, decrease, or remain unchanged with a new advertisement. The
investment in a new advertising campaign may therefore only be useful if the
probability of its success is higher than that of the current advertisement.
Similarly, one may choose the restaurant where one is most confident of getting
the food of one’s choice. In all such cases, an event may be happening or not and
depending on its likelihood, actions are taken. The purpose of this chapter is to
learn how to calculate such likelihoods of events happening and not happening.
6.1 Basic Concepts and Set Theory
A simple (not rigorous) definition of a random experiment requires that the
experiment can be repeated any number of times under the same set of
conditions, and its outcome is known only after the completion of the
experiment. A simple and classical example of a random experiment is the
tossing of a coin or the rolling of a die. When tossing a coin, it is unknown what
the outcome will be, head or tail, until the coin is tossed. The experiment can be
repeated and different outcomes may be observed in each repetition. Similarly,
when rolling a die, it is unknown how many dots will appear on the upper
surface until the die is rolled. Again, the die can be rolled repeatedly and
different numbers of dots are obtained in each trial. A possible outcome of a
random experiment is called a simple event (or elementary event ) and denoted
by . The set of all possible outcomes, , is called the sample
space and is denoted as , i.e. . Subsets of are called
events and are denoted by capital letters such as A, B, C. The set of all simple
events that are contained in the event A is denoted by . The event refers to
the non-occurring of A and is called a composite or complementary event .
Also is an event. Since it contains all possible outcomes, we say that will
always occur and we call it a sure event or certain event . On the other hand, if
we consider the null set as an event, then this event can never occur and
we call it an impossible event . The sure event therefore is the set of all
elementary events, and the impossible event is the set with no elementary events.
The above concepts of “events” form the basis of a definition of
“probability”. Once we understand the concept of probability, we can develop a
framework to make conclusions about the population of interest, using a sample
of data.

Example 6.1.1
(Rolling a die) If a die is rolled once, then the possible outcomes are the number
of dots on the upper surface: . Therefore, the sample space is the set of
simple events “1”, “2”, “6” and . Any
subset of can be used to define an event. For example, an event A may be “an
even number of dots on the upper surface of the die”. There are three
possibilities that this event occurs: , or . If an odd number shows up,
then the composite event occurs instead of A. If an event is defined to observe
only one particular number, say “1”, then it is an elementary event. An
example of a sure event is “a number which is greater than or equal to 1”
because any number between 1 and 6 is greater than or equal to 1. An impossible
event is “the number is 7”.

Example 6.1.2
(Rolling two dice) Suppose we throw two dice simultaneously and an event is
defined as the “number of dots observed on the upper surface of both the dice”;
then, there are 36 simple events defined as (number of dots on first die, number
of dots on second die), i.e. . Therefore is

One can define different events and their corresponding sample spaces. For
example, if an event A is defined as “upper faces of both the dice contain the
same number of dots”, then the sample space is
. If another event B is defined as “the sum of numbers on the upper
faces is 6”, then the sample space is . A sure
event is “get either an even number or an odd number”; an impossible event
would be “the sum of the two dice is greater than 13”.

It is possible to view events as sets of simple events. This helps to determine


how different events relate to each other. A popular technique to visualize this
approach is to use Venn diagrams. In Venn diagrams, two or more sets are
visualized by circles. Overlapping circles imply that both events have one or
more identical simple events. Separated circles mean that none of the simple
events of event A are contained in the sample space of B. We use the following
notations:
Fig. 6.1 and

Fig. 6.2 and

The union of events is the set of all simple events of A and B


which occurs if at least one of the simple events of A or B occurs
(Fig. 6.1, left side, grey shaded area). Please note that we use the word
“or” from a statistical perspective: “A or B” means that either a simple
event from A occurs, or a simple event from B occurs, or a simple event
which is part of both A and B occurs.
The intersection of events is the set of all simple events A and B
which occur when a simple event occurs that belongs to A and B
(Fig. 6.1, right side, grey shaded area).
The event contains all simple events of A, which are not contained
in B. The event “A but not B” or “A minus B” occurs, if A occurs but B
does not occur. Also (Fig. 6.2, left side, grey shaded area).
The event contains all simple events of , which are not contained in
A. The complementary event of A (which is “Not-A” or “ ” occurs
whenever A does not occur (Fig. 6.2, right side, grey shaded area).
A is a subset of B. This means that all simple events of A are also part
of the sample space of B.

Example 6.1.3
Consider Example 6.1.1 where the sample space of rolling a die was determined
as with “1”, “2”, “6”.
If and B is the set of all odd numbers, then
and thus .
If is the set of even numbers and is the set of all
numbers which are divisible by 3, then is the
collection of simple events for which the number is either even or divisible
by 3 or both.
If is the set of odd numbers and is the set of the
numbers which are divisible by 3, then is the set of simple
events in which the numbers are odd and divisible by 3.
If is the set of odd numbers and is the set of the
numbers which are divisible by 3, then is the set of simple
events in which the numbers are odd but not divisible by 3.
If is the set of even numbers, then is the set of
odd numbers.

Remark 6.1.1
Some textbooks also use the following notations:

We can use these definitions and notations to derive the following properties of a
particular event A:

Definition 6.1.1
Two events A and B are disjoint if holds, i.e. if both events cannot
occur simultaneously.

Example 6.1.4
The events A and are disjoint events.

Definition 6.1.2
The events are said to be mutually or pairwise disjoint, if
whenever .

Example 6.1.5
Recall Example 6.1.1. If and are the sets of odd
and even numbers, respectively, then the events A and B are disjoint.

Definition 6.1.3
The events form a complete decomposition of if and only if

and

Example 6.1.6
Consider Example 6.1.1. The elementary events
form a complete decomposition. Other complete decompositions are, e.g.

6.2 Relative Frequency and Laplace Probability


There is a close connection between the relative frequency and the probability of
an event. A random experiment is described by its possible outcomes, for
example getting a number between 1 and 6 when rolling a die. Suppose an
experiment has m possible outcomes (events) and the experiment is
repeated n times. Now we can count how many times each of the possible
outcome has occurred. In other words, we can calculate the absolute frequency
which is equal to the number of times an event , ,
occurs. The relative frequency of a random event , with n repetitions
of the experiment, is calculated as
(6.1)

Example 6.2.1
Consider roulette, a game frequently played in casinos. The roulette table
consists of 37 numbers from 0 to 36. Out of these 37 numbers, 18 numbers are
red, 18 are black and one (zero) is green. Players can place their bets on either a
single number or a range of numbers, the colours red or black, whether the
number is odd or even, among many other choices. A casino employee spins a
wheel (containing pockets representing the 37 numbers) in one direction and
then spins a ball over the wheel in the opposite direction. The wheel and ball
gradually slow down and the ball finally settles in a pocket. The pocket number
in which the ball sits down when the wheel stops is the winning number.
Consider three possible outcomes : “red”, :“black”, and : “green (zero)”.
Suppose the roulette ball is spun times. All the outcomes are counted and
recorded as follows: occurs 240 times, occurs 250 times and occurs 10
times. Then, the absolute frequencies are given by ,
, and . We therefore get the relative frequencies as

If we assume that the experiment is repeated a large number of times


(mathematically, this would mean that n tends to infinity) and the experimental
conditions remain the same (at least approximately) over all the repetitions, then
the relative frequency f(A) converges to a limiting value for A. This limiting
value is interpreted as the probability of A and denoted by P(A), i.e.

where n(A) denotes the number of times an event A occurs out of n times.

Example 6.2.2
Suppose a fair coin is tossed times and we observe the number of heads
times and number of tails times. The meaning of a fair coin
in this case is that the probabilities of head and tail are equal (i.e. 0.5). Then, the
relative frequencies in the experiment are and
. When the coin is tossed a large number of times and n tends
to infinity, then both and will have a limiting value 0.5 which is
simply the probability of getting a head or tail in tossing a fair coin.

Example 6.2.3
In Example 6.2.1, the relative frequency of tends to 18 / 37 as n
tends to infinity because 18 out of 37 numbers are red.

The reader will gain a more theoretical understanding of how repeated


experiments relate to expected quantities in the following chapters after learning
the Theorem of Large Numbers described in Appendix A.3.
A different definition of probability was given by Pierre-Simon Laplace
(1749–1827). We call an experiment a Laplace experiment if the number of
possible simple events is finite and all the outcomes are equally probable. The
probability of an arbitrary event A is then defined as follows:

Definition 6.2.1
The proportion

(6.2)
is called the Laplace probability , where |A| is the cardinal number of A, i.e. the
number of simple events contained in the set A, and is the cardinal number of
, i.e. the number of simple events contained in the set .

The cardinal numbers |A| and are often calculated using the combinatoric
rules introduced in Chap. 5.

Example 6.2.4
(Example 6.1.2 continued) The sample space contains 36 simple events. All of
these simple events have equal probability 1 / 36. To calculate the probability of
the event A that the sum of the dots on the two dice is at least 4 and at most 6, we
count the favourable simple events which fulfil this condition. The simple events
are (1, 3), (2, 2), (3, 1) (sum is 4), (1, 4), (2, 3), (4, 1), (3, 2) (sum is 5) and (1, 5),
(2, 4), (3, 3), (4, 2), (5, 1) (sum is 6). In total, there are favourable
simple events, i.e.
The probability of the event A is therefore .

6.3 The Axiomatic Definition of Probability


An important foundation for modern probability theory was established by
A.N. Kolmogorov in 1933 when he proposed the following axioms of
probability .

Axiom 1
Every random event A has a probability in the (closed) interval [0, 1], i.e.

Axiom 2
The sure event has probability 1, i.e.

Axiom 3
If and are disjoint events, then

holds.

Remark
Axiom 3 also holds for three or more disjoint events and is called the theorem of
additivity of disjoint events . For example, if , , and are disjoint events,
then

Example 6.3.1
Suppose the two events in tossing a coin are : “appearance of head” and :
“appearance of tail” which are disjoint. The event : “appearance of head
or tail” has the probability

Example 6.3.2
Suppose an event is defined as the number of points observed on the upper
surface of a die when rolling it. There are six events, i.e. the natural numbers 1,
2, 3, 4, 5, 6. These events are disjoint and they have equal probability of
occurring: . The probability of getting an even
number is then

6.3.1 Corollaries Following from Kolomogorov’s


Axioms
We already know that (sure event). Since A and are disjoint, using
Axiom 3 we have

Based on this, we have the following corollaries.

Corollary 1
The probability of the complementary event of A, (i.e. ) is
(6.3)

Example 6.3.3
Suppose a box of 30 chocolates contains chocolates of 6 different flavours with 5
chocolates of each flavour. Suppose an event A is defined as
. The probability of finding a marzipan chocolate (without
looking into the box) is . Then, the probability of the
complementary event , i.e. the probability of not finding a marzipan chocolate
is therefore

Corollary 2
The probability of occurrence of an impossible event is zero:

Corollary 3
Let and be not necessarily disjoint events. The probability of occurrence
of or is
(6.4)

The rule in (6.4) is known as the additive theorem of probability . Again we


use the word “or” in the statistical sense: either is occurring, is occurring,
or both of them. This means we have to add the probabilities and but
need to make sure that the simple events which are contained in both sets are not
counted twice, thus we subtract .

Example 6.3.4
There are 10 actors acting in a play. Two actors, one of whom is male, are
portraying evil characters. In total, there are 6 female actors. Let an event A
describe whether the actor is male and another event B describe whether the
character is evil. Suppose we want to know the probability of a randomly chosen
actor being male or evil. We can then calculate

Corollary 4
If then .

Proof
We use the representation where A and are the disjoint
events. Then using Axiom 3 and Axiom 1, we get

6.3.2 Calculation Rules for Probabilities


The introduced axioms and corollaries can be summarized as follows:

(1)

(2)

(3) , if and are disjoint


(4)

(5)

(6)

(7) , if

6.4 Conditional Probability


Consider the following example to understand the concept of conditional
probability: Suppose a new medical test is developed to diagnose a particular
infection of the blood. The test is conducted on blood samples from 100
randomly selected patients and the outcomes of the tests are presented in
Table 6.1.
There are the following four possible outcomes:
The blood sample has an infection and the test diagnoses it, i.e. the test is
correctly diagnosing the infection.
The blood sample does not have an infection and the test does not diagnose
it, i.e. the test is correctly diagnosing that there is no infection.
The blood sample has an infection and the test does not diagnose it, i.e. the
test is incorrect in stating that there is no infection.
The blood sample does not have an infection but the test diagnoses it, i.e.
the test is incorrect in stating that there is an infection.
Table 6.2 contains the relative frequencies of Table 6.1. In the following, we
interpret the relative frequencies as probabilities, i.e. we assume that the values
in Table 6.2 would be observed if the number n of patients was much larger than
100.
Table 6.1 Absolute frequencies of test results and infection status

Infection Total (row)


Present Absent
Test Positive (+) 30 10 40
Negative (−) 15 45 60
Total (column) 45 55 Total = 100

Table 6.2 Relative frequencies of patients and test


Infection Total (row)
Present (IP) Absent (IA)
Test Positive (+) 0.30 0.10 0.40
Negative (−) 0.15 0.45 0.60
Total (column) 0.45 0.55 Total = 1

It can be seen that the probability that a test is positive is


and the probability that an infection is present is
. If one already knows that the test is positive and wants
to determine the probability that the infection is indeed present, then this can be
achieved by the respective conditional probability which is

Note that denotes the “relative frequency of blood samples in which


the disease is present and the test is positive” which is 0.3.
More generally, recall Definition 4.1.1 from Chap. 4 where we defined
conditional, joint, and marginal frequency distributions in contingency tables.
The present example simply applies these rules to the contingency tables of
relative frequencies and interprets the relative frequencies as an approximation
to the probabilities of interest, as already explained.
We use the intersection operator to describe events which occur for
and . This relates to the joint relative frequencies. The marginal relative
frequencies (i.e. probabilities ) can be observed from the column and
row sums, respectively; and the conditional probabilities can be observed as the
joint frequencies in relation to the marginal frequencies.
For simplicity, assume that all simple events in are equally
probable, i.e. . Let A and B be two events containing
and numbers of simple events. Let further contain numbers of
simple events. The Laplace probability using (6.2) is

Assume that we have prior information that A has already occurred. Now we
want to find out how the probability of B is to be calculated. Since A has already
occurred, we know that the sample space is reduced by the number of simple
events which are contained in A. There are such simple events. Thus, the total
sample space is reduced by the sample space of A. Therefore, the simple
events in are those simple events which are realized when B is realized.
The Laplace probability for B under the prior information on A, or under the
condition that A is known, is therefore

(6.5)
This can be generalized to the case when the probabilities for simple events
are unequal.

Definition 6.4.1
Let . Then the conditional probability of event B occurring, given that
event A has already occurred, is

(6.6)
The roles of A and B can be interchanged to define P(A|B) as follows. Let
. The conditional probability of A given B is

(6.7)

We now introduce a few important theorems which are relevant to calculating


conditional and other probabilities.

Theorem 6.4.1
(Multiplication Theorem of Probability) For two arbitrary events A and B, the
following holds:
(6.8)

This theorem follows directly from the two definitions (6.6) and (6.7) (but does
not require that and ).

Theorem 6.4.2
(Law of Total Probability) Assume that are events such that
and , i.e. form a
complete decomposition of in pairwise disjoint events, then the
probability of an event B can be calculated as

(6.9)

6.4.1 Bayes’ Theorem


Bayes’ Theorem gives a connection between P(A|B) and P(B|A). For events A
and B with and , using (6.6) and (6.7) or (6.8), we get

(6.10)

Let be events such that and


, and B is another event than A, then using
(6.9) and (6.10), we get

(6.11)
The probabilities are called prior probabilities, are sometimes
called model probabilities and are called posterior probabilities .

Example 6.4.1
Suppose someone rents movies from two different DVD stores. Sometimes it
happens that the DVD does not work because of scratches. We consider the
following events: : “the DVD is rented from store i”. Further let B
denote the event that the DVD is working without any problems. Assume we
know that and (note that ) and ,
and we are interested in the probability that a rented DVD works
fine. We can then apply the Law of Total Probability and get

We may also be interested in the probability that the movie was rented from
store 1 and is working which is
Now suppose we have a properly working DVD. What is the probability that it is
rented from store 1? This is obtained as follows:

Now assume we have a DVD which does not work, i.e. occurs. The
probability that a DVD is not working given that it is from store 1 is
. Similarly, for store 2. We can now calculate the
conditional probability that a DVD is from store 1 given that it is not working:

The result about used in the denominator can also be directly obtained by
using .

6.5 Independence
Intuitively, two events are independent if the occurrence or non-occurrence of
one event does not affect the occurrence or non-occurrence of the other event. In
other words, two events A and B are independent if the probability of occurrence
of B has no effect on the probability of occurrence of A. In such a situation, one
expects that

Using this and (6.7), we can write

(6.12)

This yields:

(6.13)

This leads to the following definition of stochastic independence.


Definition 6.5.1
Two random events A and B are called (stochastically) independent if
(6.14)
i.e. if the probability of simultaneous occurrence of both events A and B is the
product of the individual probabilities of occurrence of A and B.

This definition of independence can be extended to the case of more than two
events as follows:

Definition 6.5.2
The n events are stochastically mutually independent, if for any
subset of m events ( )

(6.15)
holds.

A weaker form of independence is pairwise independence. If condition (6.15) is


fulfilled only for two arbitrary events, i.e. , then the events are called
pairwise independent . The difference between pairwise independence and
general stochastic independence is explained in the following example.

Example 6.5.1
Consider an urn with four balls. The following combinations of zeroes and ones
are printed on the balls: 110, 101, 011, 000. One ball is drawn from the urn.
Define the following events:

Since there are two favourable simple events for each of the events and
, we get

The probability that all the three events simultaneously occur is zero because
there is no ball with 111 printed on it. Therefore, , and are not
stochastically independent because
However,

which means that the three events are pairwise independent.

6.6 Key Points and Further Issues

6.7 Exercises
Exercise 6.1
(a) Suppose , , , .
Determine , , , , .

(b) Now consider the three pairwise disjoint events E, F, G with


and and . Calculate , P(G), , , and
.

Exercise 6.2
A driving licence examination consists of two parts which are based on a
theoretical and a practical examination. Suppose 25 of people fail the
practical examination, 15 of people fail the theoretical examination, and 10
of people fail both the examinations. If a person is randomly chosen, then what
is the probability that this person

(a) fails at least one of the examinations?

(b) only fails the practical examination, but not the theoretical examination?

(c) successfully passes both the tests?

(d) fails any of the two examinations?

Exercise 6.3
A new board game uses a twelve-sided die. Suppose the die is rolled once, what
is the probability of getting

(a) an even number?

(b) a number greater than 9?

(c) an even number greater than 9?


(d) an even number or a number greater than 9?

Exercise 6.4
The Smiths are a family of six. They are celebrating Christmas and there are 12
gifts, two for each family member. The name tags for each family member have
been attached to the gifts. Unfortunately the name tags on the gifts are damaged
by water. Suppose each family member draws two gifts at random. What is the
probability that someone

(a) gets his/her two gifts, rather than getting the gifts for another family
member?

(b) gets none of his/her gifts, but rather gets the gifts for other family
members?

Exercise 6.5
A chef from a popular TV cookery show sometimes puts too much salt in his
pumpkin soup and the probability of this happening is 0.2. If he is in love (which
he is with probability 0.3), then the probability of using too much salt is 0.6.

(a) Create a contingency table for the probabilities of the two variables “in
love” and “too much salt”.

(b) Determine whether the two variables are stochastically independent or not.

Exercise 6.6
Dr. Obermeier asks his neighbour to take care of his basil plant while he is away
on leave. He assumes that his neighbour does not take care of the basil with a
probability of . The basil dies with probability when someone takes care of it
and with probability if no one takes care of it.

(a) Calculate the probability of the basil plant surviving after its owner’s leave.
(b) It turns out that the basil eventually dies. What is the probability that Dr.
Obermeier’s neighbour did not take care of the plant?

Exercise 6.7
A bank considers changing its credit card policy. Currently 5 % of credit card
owners are not able to pay their bills in any month, i.e. they never pay their bills.
Among those who are generally able to pay their bills, there is still a 20 %
probability that the bill is paid too late in a particular month.

(a) What is the probability that someone is not paying his bill in a particular
month?

(b) A credit card owner did not pay his bill in a particular month. What is the
probability that he never pays back the money?

(c) Should the bank consider blocking the credit card if a customer does not
pay his bill on time?

Exercise 6.8
There are epidemics which affect animals such as cows, pigs, and others.
Suppose 200 cows are tested to see whether they are infected with a virus or not.
Let event A describe whether a cow has been transported by a truck recently or
not and let B denote the event that a cow has been tested positive with a virus.
The data are summarized in the following table:

B Total
A 40 60 100
20 80 100
Total 60 140 200

(a) What is the probability that a cow is infected and has been transported by a
truck recently?

(b) What is the probability of having an infected cow given that it has been
transported by the truck?

(c) Determine and interpret P(B).

Exercise 6.9
A football practice target is a portable wall with two holes (which are the target)
in it for training shots. Suppose there are two players A and B. The probabilities
of hitting the target by A and B are 0.4 and 0.5, respectively.

(a) What is the probability that at least one of the players succeeds with his
shot?

(b) What is the probability that exactly one of the players hits the target?

(c) What is the probability that only B scores?

Source Toutenburg, H., Heumann, C., Induktive Statistik, 4th edition, 2007,
Springer, Heidelberg
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_7

7. Random Variables
Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

In the first part of the book we highlighted how to describe data. Now, we
discuss the concepts required to draw statistical conclusions from a sample of
data about a population of interest. For example, suppose we know the starting
salary of a sample of 100 students graduating in law. We can use this knowledge
to draw conclusions about the expected salary for the population of all students
graduating in law. Similarly, if a newly developed drug is given to a sample of
selected tuberculosis patients, then some patients may show improvement and
some patients may not, but we are interested in the consequences for the entire
population of patients. In the remainder of this chapter, we describe the
theoretical concepts required for making such conclusions. They form the basis
for statistical tests and inference which are introduced in Chaps. 9–11.

7.1 Random Variables


Random variables help us to view the collected data as an outcome of a random
experiment. Consider the simple experiment of tossing a coin. If a coin is tossed,
then one can observe either “head” (H) or “tail” (T). The occurrence of “head” or
“tail” is random, and the exact outcome will only be known after the coin is
tossed. We can toss the coin many times and obtain a sequence of outputs. For
example, if a coin is tossed seven times, then one of the outcomes may be
H, H, T, H, T, T, T. This outcome is the consequence of a random experiment,
and it may be helpful if we can distill the sequence of outcomes in meaningful
numbers. One option is to summarize them by a variable X, which takes the
values (denoting head) and (denoting tail). We have learnt from
Chap. 6 that this can be described in the framework of a random experiment
where with the events and .
The random variable X is now mapped to real numbers, and therefore, it
describes the possible outcome of any coin toss experiment. The observed
outcomes H, H, T, H, T, T, T relate to a specific sample, a unique realization of
this experiment. We can write and with and
where is the set of real numbers. We know that in any coin tossing
experiment, the probability of head being observed is and of
tail being observed is . We may therefore view X as a random
variable which collects the possible outcomes of a random experiment and
captures the uncertainty associated with them.

Definition 7.1.1
Let represent the sample space of a random experiment, and let be the set
of real numbers. A random variable is a function X which assigns to each
element one and only one number , i.e.

(7.1)

Example 7.1.1
The features of a die roll experiment, a roulette game, or the lifetime of a TV can
all be described by a random variable, see Table 7.1. The events involve
randomness, and if we have knowledge about the random process, we can assign
probabilities to each event, e.g. when rolling a die, the probability of
getting a “1” is and the probability of getting a “2” is
.

Table 7.1 Examples of random variables


X Event Realizations of X
Roll of a die : number i

Lifetime of TV : survival time is i months

Roulette : red

: black

: green (zero)

Note that it is a convention to denote random variables by capital letters (e.g.


X) and their values by small letters (e.g. x). It is evident from the coin tossing
experiment that we need to know to describe the respective random
variable. We assume in this chapter that we have this knowledge. However,
Chaps. 9–11 show how a sample of data can be used to estimate unknown
probabilities and other quantities given a prespecified uncertainty level. More
generally, we can say that it is mandatory to know for all possible A
which are subsets of . If we choose , we have

This consideration gives rise to the definition of the cumulative distribution


function. Recall that we developed the concept of the empirical cumulative
distribution function (ECDF) in Chap. 2, Sect. 2.2, but the definition there was
empirical. Now, we develop it theoretically.

7.2 Cumulative Distribution Function (CDF)


Definition 7.2.1
The cumulative distribution function (CDF) of a random variable X is defined
as
(7.2)

As in Chap. 2, we can see that the CDF is useful in obtaining the probabilities
related to the occurrence of random events. Note that the empirical cumulative
distribution function (ECDF, Sect. 2.2) and the cumulative distribution function
are closely related and therefore have a similar definition and similar calculation
rules. However, in Chap. 2, we work with the cumulative distribution of
observed values in a particular sample whereas in this chapter, we deal with
random variables modelling the distribution of a general population.
The Definition 7.2.1 implies the following properties of the cumulative
distribution function:
F(x) is a monotonically non-decreasing function
(if , it follows that ),
(the lower limit of F is 0),

(the upper limit of F is 1),


F(x) is continuous from the right, and
for all .
Another notation for is , but we use F(x).

7.2.1 CDF of Continuous Random Variables


Before giving some examples about the meaning and interpretation of the CDF,
we first need to consider some definitions and theorems.

Definition 7.2.2
A random variable X is said to be continuous if there is a function f(x) such that
for all

(7.3)
holds. F(x) is the cumulative distribution function (CDF) of X, and f(x) is the
probability density function (PDF) of x and for all x that are
continuity points of f.

Theorem 7.2.1
For a function f(x) to be a probability density function (PDF) of X, it needs to
satisfy the following conditions:

(1) for all ,

(2) .

Theorem 7.2.2
Let X be a random variable with CDF F(x). If , where and are
known constants, .

Theorem 7.2.3
The probability of a continuous random variable taking a particular value is
zero:
(7.4)
The proof is provided in Appendix C.2.

Example 7.2.1
Consider the continuous random variable “waiting time for the train”. Suppose
that a train arrives every 20 min. Therefore, the waiting time of a particular
person is random and can be any time contained in the interval [0, 20]. We can
start describing the required probability density function as

where k is an unknown constant. Now, using condition (2) of Theorem 7.2.1, we


have

which needs to be fulfilled. This yields which is always greater than 0,


and therefore, condition (1) of Theorem 7.2.1 is also fulfilled. It follows that

is the probability density function describing the waiting time for the train. We
can now use Definition 7.2.2 to determine the cumulative distribution function:

Suppose we are interested in calculating the probability of a waiting time


between 15 and 20 min. This can be calculated using Theorem 7.2.2:

We can obtain this probability from the graph of the CDF as well, see Fig. 7.1
where both the PDF and CDF of this example are illustrated.
Fig. 7.1 Probability density function (PDF) and cumulative distribution function (CDF) for waiting time in
Example 7.2.1

Defining a function, for example the CDF, is simple in R: One can use the
function command followed by specifying the variables the function
evaluates in round brackets (e.g. x) and the function itself in braces (e.g. x / 20).
Functions can be plotted using the curve command:

Alternatively, the plot command can be used to plot vectors against each
other; for example, after defining a function, we can define a sequence (
), evaluate this sequence via the specified function
(cdf(x)), and plot them against each other and connect the points from the
sequence with a line (plot(x,cdf(x),type=’l’)).
This example illustrates how the cumulative distribution function can be
used to obtain probabilities of interest. Most importantly, if we want to calculate
the probability that the random variable X takes values in the interval , we
simply have to look at the difference of the respective CDF values at and .
Figure 7.2a highlights that the interval probability corresponds to the difference
of the CDF values on the y-axis.
We can also use the probability density function to visualize .
We know from Theorem 7.2.1 that , and therefore, the area under
the PDF equals 1. Thus, we can interpret interval probabilities as the area under
the PDF between and . This is presented in Fig. 7.2b.

Fig. 7.2 Graphical representation of the probability a via the CDF and b via the PDF

7.2.2 CDF of Discrete Random Variables


Definition 7.2.3
A random variable X is defined to be discrete if its probability space is either
finite or countable, i.e. if it takes only a finite or countable number of values.
Note that a set V is said to be countable, if its elements can be listed, i.e. there is
a one-to-one correspondence between V and the positive integers.

Example 7.2.2
Consider the example of tossing of a coin where each trial results in either a
head (H) or a tail (T), each occurring with the same probability 0.5. When the
coin is tossed multiple times, we may observe sequences such as
H, T, H, H, T, H, H, T, and . The sample space is . Let the random
variable X denote the number of trials required to get the third head, then
for the given sequence above. Clearly, the space of X is the set . We
can see that X is a discrete random variable because its space is finite and can be
counted. We can also assign certain probabilities to each of these values, e.g.
and .
Definition 7.2.4
Let X be a discrete random variable which takes k different values. The
probability mass function (PMF) of X is given by
(7.5)
It is required that the probabilities satisfy the following conditions:

(1) ,

(2) .

Definition 7.2.5
Given (7.5), we can write the CDF of a discrete random variable as

(7.6)
where I is an indicator function defined as

The CDF of a discrete variable is always a step function.


Working with the CDF for Discrete Random variables
We can easily calculate various types of probabilities for discrete random
variables using the CDF. Let a and b be some known constants, then
(7.7)
(7.8)
(7.9)
(7.10)

(7.11)

(7.12)
(7.13)
(7.14)

Remark 7.2.1
The Eqs. (7.7)–(7.14) can also be used for continuous variables, but in this case,
= = 0 (see Theorem 7.2.3), and therefore, Eqs. (7.7)–(7.14) can
be modified accordingly.

Example 7.2.3
Consider the experiment of rolling a die. There are six possible outcomes. If we
define the random variable X as the number of dots observed on the upper
surface of the die, then the six possible outcomes can be described as
. The respective probabilities are
. The PMF and CDF are therefore defined as follows:

Both the CDF and the PDF are displayed in Fig. 7.3.

Fig. 7.3 Probability density function and cumulative distribution function for rolling a die in Example
7.2.3. “ ” relates to an included value and “ ” to an excluded value

We can use the CDF to calculate any desired probability, e.g.


. This is shown in Fig. 7.3b where for , we obtain
when evaluating on the y-axis. Similarly,
can be interpreted as the difference
of F(5) and F(3) on the y-axis.

7.3 Expectation and Variance of a Random Variable


We have seen that both the probability density function (or probability mass
function) and the cumulative distribution function are helpful in characterizing
the features of a random variable. Some other features of random variables are
characterized by the concepts of expectation and variance.

7.3.1 Expectation
Definition 7.3.1
The expectation of a continuous random variable X, having the probability
density function f(x) with , is defined as

(7.15)
For a discrete random variable X, which takes the values with
respective probabilities , the expectation of X is defined as

(7.16)

The expectation of X, i.e. , is usually denoted by and relates to the


arithmetic mean of the distribution of the population. It reflects the central
tendency of the population.

Example 7.3.1
Consider again Example 7.2.1 where the waiting time for a train was described
by the following probability density function:

We can calculate the expectation as follows:


The “average” waiting time for the train is therefore 10 min. This means that if a
person has to wait for the train every day, then the person will experience
waiting times varying randomly between 0 and 20 min and, on average, has to
wait for 10 min.

Example 7.3.2
Consider again the die roll experiment from Example 7.2.3. The probabilities for
the occurrence of any , are . The expectation can
thus be calculated as

7.3.2 Variance
The variance describes the variability of a random variable. It gives an idea
about the concentration or dispersion of values around the arithmetic mean of the
distribution.

Definition 7.3.2
The variance of a random variable X is defined as
(7.17)
The variance of a continuous random variable X is

(7.18)
where . Similarly, the variance of a discrete random variable X
is
(7.19)
where The variance is usually denoted by .

Definition 7.3.3
The positive square root of the variance is called the standard deviation.

Example 7.3.3
Recall Examples 7.2.1 and 7.3.1. We can calculate the variance of the waiting
time for a train using the probability density function

and (already calculated in Example 7.3.1). Using (7.18), we obtain:

The standard deviation is .

Recall that in Chap. 3, we introduced the sample variance and the sample
standard deviation. We already know that the standard deviation has the same
unit of measurement as the variable, whereas the unit of the variance is the
square of the measurement unit. The standard deviation measures how the values
of a random variable are dispersed around the population mean. A low value of
the standard deviation indicates that the values are highly concentrated around
the mean. A high value of the standard deviation indicates lower concentration
of the data values around the mean, and the observed values may be far away
from the mean. These considerations are helpful in making connections between
random variables and samples of data, see Chap. 9 for the construction of
confidence intervals.

Example 7.3.4
Recall Example 7.3.2 where we calculated the expectation of a die roll
experiment as . With and
, the variance for this example corresponds to

Theorem 7.3.1
The variance of a random variable X can be expressed as
(7.20)
The proof is given in Appendix C.2.

Example 7.3.5
In Examples 7.2.1, 7.3.1, and 7.3.3, we evaluated the waiting time for a train
using the PDF

We calculated the expectation and variance in Eqs. (7.15) and (7.17) as 10 min
and min , respectively. Theorem 7.3.1 tells us that we can calculate the
variance in a different way as follows:

This yields the same result as Eq. (7.18) but is much quicker.

7.3.3 Quantiles of a Distribution


We introduced the concept of quantiles in Chap. 3, Sect. 3.1.2. Now, we define
quantiles in terms of the distribution function.

Definition 7.3.4
The value for which the cumulative distribution function is
(7.21)
is called the p-quantile .

It follows from Definition 7.3.4 that is the value which divides the cumulative
distribution function into two parts: the probability of observing a value left of
is p, whereas the probability of observing a value right of is . For
example, the 0.25-quantile describes the x-value for which the probability
of observing or any smaller value is 0.25. Figure 7.4 shows the 0.25-
quantile (first quartile), the 0.5-quantile (median), and the 0.75-quantile (third
quartile) in a cumulative distribution function.

Fig. 7.4 First quartile, median, and third quartile

Example 7.3.6
Recall Examples 7.2.1, 7.3.1, 7.3.5 and Fig. 7.1b where we described the waiting
time for a train by using the following CDF:

The first quartile is 5 because . This means that the


probability of waiting for the train for 5 min or less is 25 % and of waiting for
longer than 5 min is 75 %.

For continuous variables, there is a unique value which describes the p-quantile.
However, for discrete variables, this may not necessarily be true. In this case, the
p-quantile is chosen such that

holds.
Example 7.3.7
The cumulative distribution function for rolling a die is described in Example
7.2.3 and Fig. 7.3b. The first quartile is 2 because and
for .

7.3.4 Standardization
Standardization transforms a random variable in such a way that it has an
expectation of zero and a variance of one. More details on the need for
standardization are discussed in Chap. 10.

Definition 7.3.5
A random variable Y is called standardized when

Theorem 7.3.2
Suppose a random variable X has mean and . Then, it can be
standardized as follows:

(7.22)

Example 7.3.8
In Examples 7.2.1, 7.3.1, and 7.3.5, we considered the waiting time X for a train.
The random variable X can take values between 0 and 20 min, and we calculated
and . The standardized variable of X is

One can show that and , see also Exercise 7.10 for more
details.

7.4 Tschebyschev’s Inequality


If we do not know the distribution of a random variable X, we can still make
statements about the probability that X takes values in a certain interval (which
has to be symmetric around the expectation ) if the mean and the variance
of X are known.

Theorem 7.4.1
(Tschebyschev’s inequality) Let X be a random variable with and
. It holds that

(7.23)
This is equivalent to

(7.24)
The proof is given in Appendix C.2.

Example 7.4.1
In Examples 7.2.1, 7.3.1, and 7.3.5, we have worked with a random variable
which describes the waiting time for a train. We determined and
. We can calculate the probability of waiting between and
min:

The probability is therefore at least 0.32. However, if we apply our distributional


knowledge that (for ), then we obtain a much more precise
result which is

We can clearly see that Tschebyschev’s inequality gives us the correct answer,
that is is greater 0.32. Nevertheless, the approximation to the exact
probability, 0.7, is rather poor. One needs to keep in mind that only the lack of
distributional knowledge makes the inequality useful.

7.5 Bivariate Random Variables


There are many situations in which we are interested in analysing more than one
variable, say two variables. When we have more than one variable, then not only
their individual distributions but also their joint distribution can be of interest.
For example, we know that driving a car after drinking alcohol is not necessarily
safe. If we consider two variables, the blood alcohol content X and number of car
accidents Y, then we may be interested in the probability of having a high blood
alcohol content and a car accident at the same time. If we analyse (X, Y) jointly,
then we are interested in their joint bivariate distribution . This
distribution can either be discrete or continuous.
Discrete Bivariate Random Variables. Suppose we have two categorical
variables X and Y which can take the values and ,
respectively. Their joint probability distribution function is characterized by

with . This means that the probability of observing and


together is . We can summarize this information in a contingency table as
follows:

Y
1 2 J Total

1 ...

2 ...

Total 1

Each cell contains a “piece” of the joint distribution. The entries


in the bottom row of the table summarize the marginal
distribution of Y, which is the distribution of Y without giving reference to X.
The entries in the last column summarize the marginal
distribution of X. The marginal distributions can therefore be expressed as
The conditional distributions of X given and Y given are
given as follows:

They summarize the distribution of X for a given value of (or the


distribution of Y for a given value of ) and play a crucial role in the
construction of regression models such as the linear regression model introduced
in Chap. 11. Please also recall the definitions of Sect. 4.1 where we introduced
conditional and marginal distributions for data samples rather than random
variables.

Example 7.5.1
Suppose we have a contingency table on smoking behaviour X (
, and ) and education
level Y ( , and ):

Y
1 2 3 Total
X 1 0.10 0.20 0.30 0.60
2 0.10 0.10 0.10 0.30
3 0.08 0.01 0.01 0.10
Total 0.28 0.31 0.41 1

The cell entries represent the joint distribution of smoking behaviour and
education level. We can interpret each entry as the probability of observing
and simultaneously. For example, (“smoking sometimes
and tertiary education”) . The marginal distribution of X is contained in the
last column of the table and lists the probabilities of smoking (unconditional on
education level), e.g. the probability of being a non-smoker in this population is
60 %. We can also interpret the conditional distributions: represents
the distribution of smoking behaviour among those who have tertiary education.
If we are interested in the probability of smoking sometimes given tertiary
education is completed, then we calculate .

Continuous Bivariate Random Variables.

Definition 7.5.1
A bivariate random variable (X, Y) is continuous if there is a function
such that

(7.25)
holds.

The function is the joint cumulative distribution function of X and Y;


the joint distribution function is denoted by , and has to fulfil the
usual conditions of a density function. Necessary and sufficient conditions that a
function is a bivariate cumulative distribution function are as follows:

and
The last condition is sometimes referred to as the rectangle inequality. As in
the univariate case, we can use the cumulative distribution function to calculate
interval probabilities; similarly, we look at the rectangular area defined by
, and in the bivariate case (instead of an interval
[a, b]), see Fig. 7.5.

Fig. 7.5 Area covering all points of (X, Y) with


We can calculate the desired probabilities as follows:

The marginal distributions of X and Y are

respectively. Similar to the discrete case, and also describe the


distribution of X unconditional on Y and the distribution of Y unconditional on X
. The cumulative marginal distributions are

The conditional distributions can be obtained by the ratio of the joint and
marginal distributions:

Example 7.5.2
Consider the function

Suppose X and Y represent the concentrations of two drugs in the human body.
Then, may represent the sum of two drug concentrations in the human
body. Since there are infinite possible realizations of both X and Y, we represent
their joint distribution in a figure rather than a table, see Fig. 7.6a.
Fig. 7.6 Joint and marginal distribution for Example 7.5.2

The marginal distributions for X and Y can be calculated as follows:

Figure 7.6b depicts the marginal distribution for X. The slope of the marginal
distribution is essentially the slope of the surface of the joint distribution shown
in Fig. 7.6a. It is easy to see in this simple example that the marginal distribution
of X is nothing but a cut in the surface of the joint distribution. Note that the
conditional distributions and can be easily calculated; for
example, .

Stochastic Independence.

Definition 7.5.2
Two continuous random variables X and Y are said to be stochastically
independent if
(7.26)
For discrete variables, this is equivalent to
(7.27)
being valid for all (i, j).

Example 7.5.3
In Example 7.5.2, we considered the function

with the marginal distributions of X and Y as and ,


respectively. Since = , it follows that X and Y are not
independent. The interpretation is that the concentrations of the two drugs are
not independent.

7.6 Calculation Rules for Expectation and Variance


Calculation Rules for the Expectation. For any constant values a and b, and
any random variables X and Y, the following rules hold:
(7.28)
(7.29)
(7.30)
(7.31)
The proof of rule (7.30) is given in Appendix C.2.

Example 7.6.1
Consider again Example 7.2.3 where we illustrated how the outcome of a die roll
experiment can be captured by a random variable. There were 6 events, and X
could take the values . The probability of the occurrence
of any number was , and the expectation was calculated as 3.5.
Consider two different situations:

(i) Suppose the die takes the value 10, 20, 30, 40, 50, and 60 instead of the
values 1, 2, 3, 4, 5, and 6. The random variable describes this
suitably, and its expectation is

which follows from (7.29).

(ii) If we are rolling two dices and , then the expectation for the sum of
the two outcomes is

due to (7.31).

Calculation Rules for the Variance. Let a and b be any known constants and X
be a random variable (discrete or continuous). Then, we have the following
rules:
(7.32)
(7.33)
(7.34)
The proof of rule (7.34) is given in Appendix C.2.

Example 7.6.2
In Examples 7.2.1, 7.3.1, 7.3.3, and 7.3.5, we evaluated a random variable
describing the waiting time for a train. Now, suppose that a person first has to
catch a bus to get to the train station. If this bus arrives only every 60 min, then
the PDF of the random variable Y denoting the waiting time for the bus is

We can use Eqs. (7.15) and (7.17) to determine both the expectation and
variance of Y. However, the waiting time for the bus is governed by the relation
where X is the waiting time for the train. Therefore, we can calculate
min by using rule (7.29) and the variance as
using rule (7.33). The total waiting time
is the sum of the two waiting times.

7.6.1 Expectation and Variance of the Arithmetic Mean


Definition 7.6.1
We define the random variables to be i.i.d. (independently
identically distributed) , if all follow the same distribution and are
stochastically independent of each other.

Let be n i.i.d. random variables with and


. The arithmetic mean of these variables is given by

which is again a random variable that follows a distribution with certain


expectation and variance. A function of random variables is called a statistic. By
using (7.29) and (7.31), we obtain

(7.35)
If we apply (7.34) and recall that the variables are independent of each other, we
can also calculate the variance as
(7.36)

Example 7.6.3
If we toss a coin, we obtain either head or tail, and therefore,
. If we toss the coin n times, we have for each toss

It is straightforward to calculate the expectation and variance for each coin toss:

The arithmetic mean describes the relative frequency of heads when


the coin is tossed n times. We can now apply (7.35) and (7.36) to calculate

and

With this example, the interpretation of formulae (7.35) and (7.36) becomes
clearer: if the probability of head is 0.5 for a single toss, then it is also 0.5 for the
mean of all tosses. If we toss a coin many times, then the variance decreases
when n increases. This means that a larger sample size yields a higher precision
for the calculated arithmetic mean. This observation shows the basic conclusion
of the next chapter: the higher the sample size, the more secure we are of our
conclusions.

7.7 Covariance and Correlation


The variance measures the variability of a variable. Similarly, the covariance
measures the covariation or association between X and Y.

7.7.1 Covariance
Definition 7.7.1
The covariance between X and Y is defined as
(7.37)

The covariance is positive if, on average, larger values of X correspond to larger


values of Y; it is negative if, on average, greater values of X correspond to
smaller values of Y.
The probability density function of any bivariate random variable (X, Y) is
characterized by the expectation and variance of both X and Y,

as well as their covariance. We can summarize these features by using the


expectation vector

and the covariance matrix

Important properties of covariance are

(i) ,

(ii) ,

(iii) ,

(iv) where for continuous


variables and for discrete variables,

(v) If X and Y are independent, it follows that , and


therefore, .

Theorem 7.7.1
(Additivity Theorem) The variance of the sum (subtraction) of X and Y is given
by

If X and Y are independent, it follows that and therefore


. We omit the proof of this theorem.

Example 7.7.1
Recall Example 7.6.2 where we considered the waiting time Y for a bus to the
train station and the waiting time X for the waiting time for a train. Suppose their
joint bivariate probability density function can be written as

To calculate the covariance between X and Y, we need to calculate :

We know from Example 7.6.2 that , and


. The covariance is thus

This makes sense as the waiting times for the train and the bus should be
independent of each other. Using rule (7.31), we conclude that the total expected
waiting time is

The respective variance is

due to Theorem 7.7.1.

7.7.2 Correlation Coefficient


Definition 7.7.2
The correlation coefficient of X and Y is defined as
(7.38)

We already know from Chap. 4 that the correlation coefficient is a measure of


the degree of linear relationship between X and Y. It can take values between
and . However in Chap. 4, we considered the correlation of two
samples, i.e. realizations of random variables; here, we describe the correlation
coefficient of the population. If , then X and Y are said to be
uncorrelated. If there is a perfect linear relationship between X and Y, then
for a positive relationship and for a negative relationship, see Appendix
C.2 for the proof.

Theorem 7.7.2
If X and Y are independent, they are also uncorrelated. However, if they are
uncorrelated then they are not necessarily independent.

Example 7.7.2
In Example 7.6.2, we estimated the covariance between the waiting time for the
bus and the waiting time for the train: . The correlation coefficient is
therefore also 0 indicating no linear relationship between the waiting times for
bus and train.

7.8 Key Points and Further Issues


7.9 Exercises
Exercise 7.1
Consider the following cumulative distribution function of a random variable X:

(a) What is the PDF of X?

(b) Calculate and .


(c) Determine and .

Exercise 7.2
Joey manipulates a die to increase his chances of winning a board game against
his friends. In each round, a die is rolled and larger numbers are generally an
advantage. Consider the random variable X denoting the outcome of the rolled
die and the respective probabilities , and
.

(a) Calculate and interpret the expectation and variance of X.

(b) Imagine that the board game contains an action which makes the players
use 1 / X rather than X. What is the expectation of ? Is
?

Exercise 7.3
An innovative winemaker experiments with new grapes and adds a new wine to
his stock. The percentage sold by the end of the season depends on the weather
and various other factors. It can be modelled using the random variable X with
the CDF as

(a) Plot the cumulative distribution function with R.

(b) Determine f(x).

(c) What is the probability of selling at least one-third of his wine, but not
more than two thirds?
(d) Define the CDF in R and calculate the probability of c) again.

(e) What is the variance of X?

Exercise 7.4
A quality index summarizes different features of a product by means of a score.
Different experts may assign different quality scores depending on their
experience with the product. Let X be the quality index for a tablet. Suppose the
respective probability density function is given as follows:

(a) Determine c such that f(x) is a proper PDF.

(b) Determine the cumulative distribution function.

(c) Calculate the expectation and variance of X.

(d) Use Tschebyschev’s inequality to determine the probability that X does not
deviate more than 0.5 from its expectation.

Exercise 7.5
Consider the joint PDF for the type of customer service X (0 telephonic
hotline, 1 Email) and of satisfaction score Y (1 unsatisfied, 2 satisfied, 3
very satisfied):

1 2 3

0 0 1/2 1/4
1 1 / 6 1 / 12 0
(a) Determine and interpret the marginal distributions of both X and Y.

(b) Calculate the 75 % quantile for the marginal distribution of Y.

(c) Determine and interpret the conditional distribution of satisfaction level for
.

(d) Are the two variables independent?

(e) Calculate and interpret the covariance of X and Y.

Exercise 7.6
Consider a continuous random variable X with expectation 15 and variance 4.
Determine the smallest interval which contains at least 90 % of the
values of X.

Exercise 7.7
Let X and Y be two random variables for which only 6 possible events—
—are defined:

(a) What is the joint PDF of X and Y?

(b) Calculate the marginal distributions of X and Y.

(c) Are both variables independent?


(d) Determine the joint PDF for .

(e) Calculate and and compare it with and


, respectively.

Exercise 7.8
Recall the urn model we introduced in Chap. 5. Consider an urn with eight balls:
four of them are white, three are black, and one is red. Now, two balls are drawn
from the urn. The random variables X and Y are defined as follows:

(a) When are X and Y independent—when the two balls are drawn with
replacement or without replacement?

(b) Assume the balls are drawn such that X and Y are dependent. Use the
conditional distribution P(Y|X) to determine the joint PDF of X and Y.

(c) Calculate , and .

Exercise 7.9
If X is the amount of money spent on food and other expenses during a day (in €)
and Y is the daily allowance of a businesswoman, the joint density of these two
variables is given by

(a) Choose c such that is a probability density function.


(b) Find the marginal distribution of X.

(c) Calculate the probability that more than €75 are spent.

(d) Determine the conditional distribution of Y given X.

Exercise 7.10
Consider n i.i.d. random variables with and and the
standardized variable . Show that and .

Source Toutenburg, H., Heumann, C., Induktive Statistik, 4th edition, 2007,
Springer, Heidelberg
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_8

8. Probability Distributions
Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

We introduced the concept of probability density and probability mass functions


of random variables in the previous chapter. In this chapter, we are introducing
some common standard discrete and continuous probability distributions which
are widely used for either practical applications or constructing statistical
methods described later in this book. Suppose we are interested in determining
the probability of a certain event. The determination of probabilities depends
upon the nature of the study and various prevailing conditions which affect it.
For example, the determination of the probability of a head when tossing a coin
is different from the determination of the probability of rain in the afternoon.
One can speculate that some mathematical functions can be defined which depict
the behaviour of probabilities under different situations. Such functions have
special properties and describe how probabilities are distributed under different
conditions. We have already learned that they are called probability distribution
functions. The form of such functions may be simple or complicated depending
upon the nature and complexity of the phenomenon under consideration. Let us
first recall and extend the definition of independent and identically distributed
random variables:

Definition 8.0.1
The random variables are called independent and identically
distributed (i.i.d) if the have the same marginal cumulative
distribution function F(x) and if they are mutually independent.

Example 8.0.1
Suppose a researcher plans a survey on the weight of newborn babies in a
country. The researcher randomly contacts 10 hospitals with a maternity ward
and asks them to randomly select 20 of the newborn babies (no twins) born in
the last 6 months and records their weights. The sample therefore consists of
baby weights. Since the hospitals and the babies are randomly
selected, the babies’ weights are therefore not known beforehand. The 200
weights can be denoted by the random variables . Note that the
weights are random variables because, depending on the size of the
population, different samples consisting of 200 babies can be randomly selected.
Also, the babies’ weights can be seen as stochastically independent (an example
of stochastically dependent weights would be the weights of twins if they are
included in the sample). After collecting the weights of 200 babies, the
researcher has a sample of 200 realized values (i.e. the weights in grams). The
values are now known and denoted by .

8.1 Standard Discrete Distributions


First, we discuss some standard distributions for discrete random variables.

8.1.1 Discrete Uniform Distribution


The discrete uniform distribution assumes that all possible outcomes have equal
probability of occurrence. A more formal definition is given as follows:

Definition 8.1.1
A discrete random variable X with k possible outcomes is said to
follow a discrete uniform distribution if the probability mass function (PMF) of
X is given by
(8.1)
If the outcomes are the natural numbers ( ), the mean and
variance of X are obtained as

(8.2)

(8.3)

Example 8.1.1
If we roll a fair die, the outcomes “1”, “2”, , “6” have equal probability of
occurring, and hence, the random variable X “number of dots observed on the
upper surface of the die” has a uniform discrete distribution with PMF

The mean and variance of X are

Using the function sample() in R, it is easy to generate random numbers from


a discrete uniform distribution. The following command generates a random
sample of size 1000 from a uniform distribution with the four possible outcomes
2, 5, 8, 10 and draws a bar chart of the observed numbers. The use of the
set.seed() function allows to reproduce the generated random numbers at
any time. It is necessary to use the option replace=TRUE to simulate draws
with replacement, i.e. to guarantee that a value can occur more than once.
Fig. 8.1 Frequency distribution of 1000 generated discrete uniform random numbers with possible
outcomes (2, 5, 8, 10)

A bar chart of the frequency distribution of the 1000 sampled numbers with
the possible outcomes (2, 5, 8, 10) using the discrete uniform distribution is
given in Fig. 8.1. We see that the 1000 generated random numbers are not
exactly uniformly distributed, e.g. the numbers 5 and 10 occur more often than
the numbers 2 and 8. In fact, they are only approximately uniform. We expect
that the deviance from a perfect uniform distribution is getting smaller as we
generate more and more random numbers but will probably never be zero for a
finite number of draws. The random numbers reflect the practical situation that a
sample distribution is only an approximation to the theoretical distribution from
which the sample was drawn. More details on how to work with random
variables in R are given in Appendix A.3.

8.1.2 Degenerate Distribution


Definition 8.1.2
A random variable X has a degenerate distribution at a, if a is the only possible
outcome with . The CDF in such a case is given by

Further, and .

The degenerate distribution indicates that there is only one possible fixed
outcome, and therefore, no randomness is involved. It follows that we need at
least two different possible outcomes to have randomness in the observations of
a random variable or random experiment. The Bernoulli distribution is such a
distribution where there are only two outcomes, e.g. success and failure or male
and female. These outcomes are usually denoted by the values “0” and “1”.

8.1.3 Bernoulli Distribution


Definition 8.1.3
A random variable X has a Bernoulli distribution if the PMF of X is given as

The cumulative distribution function (CDF) of X is

The mean (expectation) and variance of a Bernoulli random variable are


calculated as
(8.4)
and
(8.5)
respectively.

A Bernoulli distribution is useful when there are only two possible outcomes,
and our interest lies in any of the two outcomes, e.g. whether a customer buys a
certain product or not, or whether a hurricane hits an island or not. The outcome
of an event A is usually coded as 1 which occurs with probability p. If the event
of interest does not occur, i.e. the complementary event occurs, the outcome is
coded as 0 which occurs with probability . So p is the probability that the
event of interest A occurs.

Example 8.1.2
A company organizes a raffle at an end-of-year function. There are 300 lottery
tickets in total, and 50 of them are marked as winning tickets. The event A of
interest is “ticket wins” (coded as ), and the probability p of having a
winning ticket is a priori (i.e. before any lottery ticket has been drawn)
According to (8.4) and (8.5), the mean (expectation) and variance of X are

8.1.4 Binomial Distribution


Consider n independent trials or repetitions of a Bernoulli experiment. In each
trial or repetition, we may observe either A or . At the end of the experiment,
we have thus observed A between 0 and n times. Suppose we are interested in
the probability of A occurring k times, then the binomial distribution is useful.

Example 8.1.3
Consider a coin tossing experiment where a coin is tossed ten times and the
event of interest is “head”. The random variable X “number of heads in 10
experiments” has the possible outcomes . A question of interest
may be: What is the probability that a head occurs in 7 out of 10 trials; or in 5
out of 10 trials? We assume that the order in which heads (and tails) appear is
not of interest, only the total number of heads is of interest.

Questions of this kind are answered by the binomial distribution. This


distribution can either be motivated as a repetition of n Bernoulli experiments (as
in the above coin tossing example) or by the urn model (see Chap. 5): assume
there are M white and black balls in the urn. Suppose n balls are drawn
randomly from the urn, the colour of the ball is recorded and the ball is placed
back into the urn (sampling with replacement). Let A be the event of interest that
a white ball is drawn from the urn. The probability of A is (the
probability of drawing a black ball is ). Since the balls are
drawn with replacement, these probabilities do not change from draw to draw.
Further, let X be the random variable counting the number of white balls drawn
from the urn in the n experiments. Since the order of the resulting colours of
balls is not of interest in this case, there are combinations where k balls are

white and balls are black. Since the balls are drawn with replacement,
every outcome of the n experiments is independent of all others. The probability
that , can therefore be calculated as
(8.6)
Please note that we can use the product because the draws are

independent. The binomial coefficient is necessary to count the number of

possible orders of the black and white balls.

Definition 8.1.4
A discrete random variable X is said to follow a binomial distribution with
parameters n and p if its PMF is given by (8.6). We also write . The
mean and variance of a binomial random variable X are given by
(8.7)
(8.8)

Remark 8.1.1
A Bernoulli random variable is therefore B(1; p) distributed.

Example 8.1.4
Consider an unfair coin where the probability of observing a tail (T) is
. Let us denote tails by “1” and heads by “0”. Suppose the coin is tossed three
times. In total, there are the following possible outcomes:

Note that the first outcome, viz. (1, 1, 1) leads to , the next 3 outcomes,

viz., (1, 1, 0), (1, 0, 1), (0, 1, 1) obtained by ( ) lead to , the next 3
outcomes, viz., (1, 0, 0), ((0, 1, 0), (0, 0, 1) obtained by ( ) lead to ,

and the last outcome, viz. (0, 0, 0) obtained by ( ) leads to . We can,

for example, calculate

Further, the mean and variance of X are

Functions for the binomial distribution, as well as many other distributions, are
implemented in R. For each of these distributions, we can easily determine the
density function (PMF, PDF) for given values and parameters, determine the
CDF, calculate quantiles and draw random numbers. Appendix A.3 gives more
details. Nevertheless, we illustrate the concept of dealing with distributions in R
in the following example.

Example 8.1.5
Suppose we roll an unfair die 50 times with the probability of a tail .
We thus deal with a B(50, 0.6) distribution which can be plotted using the
dbinom command. The prefix d stands for “density”.

A plot of the PMF of a binomial distribution with and (i.e.


B(50, 0.6)) is given in Fig. 8.2.
Fig. 8.2 PMF of a B(50, 0.6) distribution

Note that we can also calculate the CDF with R. We can use the
pbinom(x,n,p) command, where the prefix p stands for probability, to
calculate the CDF at any point. For example, suppose we are interested in
, that is the probability of observing thirty or more tails; then
we write

Similarly, we can determine quantiles. For instance, the 80 % quantile q


which describes that can be obtained by the qbinom(q,n,p)
command as follows:

If we want to generate 100 random realizations from a B(50, 0.6) distribution


we can use the rbinom command.

The binomial distribution has some nice properties. One of them is described in
the following theorem:

Theorem 8.1.1
Let and and assume that X and Y are (stochastically)
independent. Then
(8.9)

This is intuitively clear since we can interpret this theorem as describing the
additive combination of two independent binomial experiments with n and m
trials, with equal probability p, respectively. Since every binomial experiment is
a series of independent Bernoulli experiments, this is equivalent to a series of
independent Bernoulli trials with constant success probability p which in
turn is equivalent to a binomial distribution with trials.

8.1.5 Poisson Distribution


Consider a situation in which the number of events is very large and the
probability of success is very small: for example, the number of alpha particles
emitted by a radioactive substance entering a particular region in a given short
time interval. Note that the number of emitted alpha particles is very high but
only a few particles are transmitted through the region in a given short time
interval. Some other examples where Poisson distributions are useful are the
number of flu cases in a country within one year, the number of tropical storms
within a given area in one year, or the number of bacteria found in a biological
investigation.

Definition 8.1.5
A discrete random variable X is said to follow a Poisson distribution with
parameter if its PMF is given by

(8.10)
We also write . The mean and variance of a Poisson random variable
are identical:

Example 8.1.6
Suppose a country experiences tropical storms on average per year. Then
the probability of suffering from only two tropical storms is obtained by using
the Poisson distribution as
If we are interested in the probability that not more than 2 storms are
experienced, then we can apply rules (7.7)–(7.13) from Chap. 7:
. We can calculate
and from (8.10) or using R. Similar to Example 8.1.5, we use
the prefix d to obtain the PMF and the prefix p to work with the CDF, i.e. we
can use and to determine and ,
respectively.

8.1.6 Multinomial Distribution


We now consider random experiments where k distinct or disjoint events
can occur with probabilities , respectively, with the
restriction . For example, if eight parties compete in a political
election, we may be interested in the probability that a person votes for party
. Similarly one might be interested in the probability whether
tuberculosis is detected in the lungs ( ), in other organs ( ), or both ( ).
Practically, we often use the multinomial distribution to model the distribution of
categorical variables. This can be interpreted as a generalization of the binomial
distribution (where only two distinct events can occur) to the situation where
more than two events or outcomes can occur. If the experiment is repeated n
times independently, we are interested in the probability that

with . Since several events can occur, the outcome of one (of the n)
experiments is conveniently described by binary indicator variables. Let
, denote the event “ is observed in experiment i”, i.e.

with probabilities ; then, the outcome of one


experiment is a vector of length k,

with “1” being present in only one position, i.e. in position j, if occurs in
experiment i. Now, define (for each ) . Then, is
counting how often event was observed in the n independent experiments (i.e.
how often was 1 in the n experiments).

Definition 8.1.6
The random vector is said to follow a multinomial
distribution if its PMF is given as

(8.11)
with the restrictions and . We also write .
The mean of is the (component-wise) vector

The (i, j)th element of the covariance matrix is

Remark 8.1.2
Due to the restriction that are not stochastically
independent which is reflected by the negative covariance. This is also
intuitively clear: if one gets higher, another , has to become lower to
satisfy the restrictions.

We use the multinomial distribution to describe the randomness of categorical


variables. Suppose we are interested in the variable “political party”; there might
be eight political parties, and we could thus summarize this variable by eight
binary variables, each of them describing the event of party ,
being voted for. In this sense, follows a multinomial
distribution.

Example 8.1.7
Consider a simple example of the urn model. The urn contains 50 balls of three
colours: 25 red balls, 15 white balls, and 10 black balls. The balls are drawn
from the urn with replacement. The balls are placed back into the urn after every
draw, which means the draws are independent. Therefore, the probability of
drawing a red ball in every draw is . Analogously, (for white
balls) and (for black balls). Consider draws. The probability of the
random event of drawing “2 red balls, 1 white ball, and 1 black ball” is:

(8.12)
We would have obtained the same result in R using the dmultinom function:

This example demonstrates that the multinomial distribution relates to an


experiment with replacement and without considering the order of the draws.
Instead of the urn model, consider another example where we may interpret
these three probabilities as probabilities of voting for candidate , in
an election. Now, suppose we ask four voters about their choice, then the
probability of candidate receiving 2 votes, candidate receiving 1 vote, and
candidate receiving 1 vote is 18 % as calculated in (8.12).

Remark 8.1.3
In contrast to most of the distributions, the CDF of the multinomial distribution,
i.e. the function calculating , is not contained in the
base R-distribution. Please note that for , the multinomial distribution
reduces to the binomial distribution.

8.1.7 Geometric Distribution


Consider a situation in which we are interested in determining how many
independent Bernoulli trials are needed until the event of interest occurs for the
first time. For instance, we may be interested in how many tickets to buy in a
raffle until we win for the first time, or how many different drugs to try to
successfully tackle a severe migraine, etc. The geometric distribution can be
used to determine the probability that the event of interest happens at the kth trial
for the first time.
Definition 8.1.7
A discrete random variable X is said to follow a geometric distribution with
parameter p if its PMF is given by
(8.13)
The mean (expectation) and variance are given by and
, respectively.

Example 8.1.8
Let us consider an experiment where a coin is tossed until “head” is obtained for
the first time. The probability of getting a head is for each toss. Using
(8.13), we can determine the following probabilities:

Using the command structure for obtaining PMF’s in R (Appendix A as well as


Examples 8.1.5 and 8.1.6), we can determine the latter probability of as
follows:

Note that the definition of X in R slightly differs from our definition. In R, k


is the number of failures before the first success. This means we need to specify
in the dgeom function rather than k. The mean and variance for this setting
are

8.1.8 Hypergeometric Distribution


We can again use the urn model to motivate another distribution, the
hypergeometric distribution. Consider an urn with We randomly draw n balls
without replacement ,

M white balls
black balls

N total balls

i.e. we do not place a ball back into the urn once it is drawn. The order in
which the balls are drawn is assumed to be of no interest; only the number of
drawn white balls is of relevance. We define the following random variable

To be more precise, among the n drawn balls, x are white and are black.

There are possibilities to choose x white balls from the total of M white

balls, and analogously, there are possibilities to choose black

balls from the total of black balls. In total, we draw n out of N balls.
Recall the probability definition of Laplace as the number of simple favourable
events divided by all possible events. The number of combinations for all
possible events is ; the number of favourable events is

because we draw, independent of each other, x out of M balls and out of


balls. Hence, the PMF of the hypergeometric distribution is

(8.14)

for .

Definition 8.1.8
A random variable X is said to follow a hypergeometric distribution with
parameters n, M, N, i.e. , if its PMF is given by (8.14).

Example 8.1.9
The German national lottery draws 6 out of 49 balls from a rotating bowl. Each
ball is associated with a number between 1 and 49. A simple bet is to choose 6
numbers between 1 and 49. If 3 or more chosen numbers correspond to the
numbers drawn in the lottery, then one wins a certain amount of money. What is
the probability of choosing 4 correct numbers? We can utilize the
hypergeometric distribution with , and to calculate such
probabilities. The interpretation is that we “draw” (i.e. bet on) 4 out of the 6
winning balls and “draw” (i.e. bet on) another 2 out of the remaining 43 ( )
losing balls. In total, we draw 6 out of 49 balls. Calculating the number of the
favourable combinations and all possible combinations leads to the application
of the hypergeometric distribution as follows:

We would have obtained the same results using the dhyper command. Its
arguments are x, M, N, n, and thus, we specify

The H(6, 43, 6) distribution is also visualized in Fig. 8.3. It is evident that the
cumulative probability of choosing 2 or fewer correct numbers is greater than
0.9 (or 90 %), but it is very unlikely to have 3 or more numbers right. This may
explain why the national lottery pays out money only for 3 or more correct
numbers.

Fig. 8.3 The H(6, 43, 6) distribution

8.2 Standard Continuous Distributions


Now, we discuss some standard probability distributions of (absolute)
continuous random variables. Characteristics of continuous random variables are
that the number of possible outcomes is uncountably infinite and that they have a
continuous distribution function F(x). It follows that the point probabilities are
zero, i.e. . Further, we assume a unique density function f exists,
such that .

8.2.1 Continuous Uniform Distribution


A continuous analogue to the discrete uniform distribution is the continuous
uniform distribution on a closed interval in .

Definition 8.2.1
A continuous random variable X is said to follow a (continuous) uniform
distribution in the interval [a, b], i.e. , if its probability density
function (PDF) is given by

The mean and variance of are

respectively.

Example 8.2.1
Suppose a train arrives at a subway station regularly every 10 min. If a passenger
arrives at the station without knowing the timetable, then the waiting time to
catch the train is uniformly distributed with density

The “average” waiting time is min. The probability of


waiting for the train for less than 3 min is obviously 0.3 (or 30 %) which can be
calculated in R using the punif(x,a,b) command (see also Appendix A.3):
8.2.2 Normal Distribution
The normal distribution is one of the most important distributions used in
statistics. The name was given by Carl Friedrich Gauss (1777–1855), a German
mathematician, astronomer, geodesist, and physicist who observed that
measurements in geodesy and astronomy randomly deviate in a symmetric way
from their true values. The normal distribution is therefore also often called a
Gaussian distribution.

Definition 8.2.2
A random variable X is said to follow a normal distribution with parameters
and if its PDF is given by

(8.15)
We write . The mean and variance of X are

respectively. If and , then X is said to follow a standard normal


distribution , . The PDF of a standard normal distribution is given by

The density of a normal distribution has its maximum (see Fig. 8.4) at . The
density is also symmetric around . The inflexion points of the density are at
and (Fig. 8.4). A lower indicates a higher concentration around
the mean . A higher indicates a flatter density (Fig. 8.5).
Fig. 8.4 PDF of a normal distribution

Fig. 8.5 PDF of N(0, 2), N(0, 1) and N(0, 0.5) distributions

The cumulative distribution function of is

(8.16)
which is often denoted as . The value of for various values of x can
be obtained in R following the rules introduced in Appendix A.3. For example,

calculates as approximately 0.975. This means, for a standard normal


distribution the probability .

Remark 8.2.1
There is no explicit formula to solve the integral in Eq. (8.16). It has to be solved
by numerical (or computational) methods. This is the reason why CDF tables are
presented in almost all statistical textbooks, see Table C.1 in Appendix C.

Example 8.2.2
An orange farmer sells his oranges in wooden boxes. The weights of the boxes
vary and are assumed to be normally distributed with kg and kg .
The farmer wants to avoid customers being unsatisfied because the boxes are too
low in weight. He therefore asks the following question: What is the probability
that a box with a weight of less than 13 kg is sold? Using the
command in R, we get

To calculate the probability in Example 8.2.2 manually, we first have to


introduce some theoretical results.
Calculation rules for normal random variables.
Let . Using the transformation

(8.17)
every normally distributed random variable can be transformed into a
standard normal random variable. We call this transformation the Z-
transformation. We can use this transformation to derive convenient calculation
rules. The probability for is

(8.18)
Consequently, the probability for is

(8.19)
The probability that X realizes a value in the interval [a, b] is

(8.20)
Because of the symmetry of the probability density function around its
mean 0, the following equation holds for the distribution function of a
standard normal random variable for any value a:
(8.21)
It follows that , see also Fig. 8.6.
Fig. 8.6 Distribution function of the standard normal distribution

Example 8.2.3
Recall Example 8.2.2 where a farmer sold his oranges. He was interested in
for . Using (8.17), we get

To obtain , we could either use R (pnorm(4/3)) or use Table


C.1.

Distribution of the Arithmetic Mean.


Assume that . Consider a random sample of
independent and identically distributed random variables with .
Then, the arithmetic mean follows a normal distribution with mean

and variance
(8.22)
where . In summary, we get

Remark 8.2.2
In fact, in Eq. (8.22), we have used the fact that the sum of normal random
variables also follows a normal distribution, i.e.

This result can be generalized to n (not necessarily identically distributed but


independent) normal random variables. In fact, it holds that if are
independent normal variables with means and variances
, then for any real numbers , it holds that

In general, it cannot be taken for granted that the sum of two random variables
follows the same distribution as the two variables themselves. As an example,
consider the sum of two independent uniform distributions with and
. It holds that and
, but is obviously not uniformly distributed.

8.2.3 Exponential Distribution


The exponential distribution is useful in many situations, for example when one
is interested in the waiting time, or lifetime, until an event of interest occurs. If
we assume that the future lifetime is independent of the lifetime that has already
taken place (i.e. no “ageing” process is working), the waiting times can be
considered to be exponentially distributed.

Definition 8.2.3
A random variable X is said to follow an exponential distribution with parameter
if its PDF is given by
(8.23)
We write . The mean and variance of an exponentially distributed
random variable X are

respectively. The CDF of the exponential distribution is given as

(8.24)

Note, that ( ). An interesting property of the


exponential distribution is its memorylessness : if time t has already been
reached, the probability of reaching a time greater than does not depend on
t. This can be written as

The result can be derived using basic probability rules as follows:

For example, suppose someone stands in a supermarket queue for t minutes. Say
the person forgot to buy milk, so she leaves the queue, gets the milk, and stands
in the queue again. If we use the exponential distribution to model the waiting
time, we say that it does not matter what time it is: the random variable “waiting
time from standing in the queue until paying the bill” is not influenced by how
much time has elapsed already; it does not matter if we queued before or not.
Please note that the memorylessness property is shared by the geometric and the
exponential distributions.
There is also a relationship between the Poisson and the exponential
distribution:

Theorem 8.2.1
The number of events Y occurring within a continuum of time is Poisson
distributed with parameter if and only if the time between two events is
exponentially distributed with parameter .
The continuum of time depends on the problem at hand. It may be a second, a
minute, 3 months, a year, or any other time period.

Example 8.2.4
Let Y be the random variable which counts the “number of accesses per second
for a search engine”. Assume that Y is Poisson distributed with parameter
( ). The random variable X, “waiting time until the next
access”, is then exponentially distributed with parameter . We therefore
get

In this example, the continuum is 1 s. The expected number of accesses per


second is therefore , and the expected waiting time between two
accesses is s. The probability of experiencing a waiting time of less
than 0.1 s is

In R, we can obtain the same result as

8.3 Sampling Distributions


All the distributions introduced in this chapter up to now are motivated by
practical applications. However, there are theoretical distributions which play an
important role in the construction and development of various statistical tools
such as those introduced in Chaps. 9–11. We call these distributions “sampling
distributions”. Now, we discuss the -, t-, and F-distributions.

8.3.1 -Distribution
Definition 8.3.1
Let be n independent and identically N (0, 1)-distributed random
variables. The sum of their squares, , is then -distributed with n
degrees of freedom and is denoted as . The PDF of the -distribution is given
in Eq. (C.7) in Appendix C.3.

The -distribution is not symmetric. A -distributed random variable can only


realize values greater than or equal to zero. Figure 8.7a shows the -, -, and
-distributions. It can be seen that the “degrees of freedom” specify the shape
of the distribution. Their interpretation and meaning will nevertheless become
clearer in the following chapters. The quantiles of the CDF of different -
distributions can be obtained in R using the qchisq(p,df) command. They
are also listed in Table C.3 for different values of n.

Theorem 8.3.1
Consider two independent random variables which are - and -distributed,
respectively. The sum of these two random variables is -distributed.

An important example of a -distributed random variable is the sample


variance of an i.i.d. sample of size n from a normally distributed population,
i.e.

(8.25)

8.3.2 t-Distribution
Definition 8.3.2
Let X and Y be two independent random variables where and .
The ratio

follows a t -distribution (Student’s t-distribution) with n degrees of freedom.


The PDF of the t-distribution is given in Eq. (C.8) in Appendix C.3.

Figure 8.7b visualizes the -, -, and -distributions. The quantiles of different


t-distributions can be obtained in R using the qt(p,df) command. They are
also listed in Table C.2 for different values of n.
Fig. 8.7 Probability density functions of and t distributions

An application of the t-distribution is the following: if we draw a sample of


size n from a normal population and calculate the arithmetic mean
and the sample variance , then the following theorem holds:

Theorem 8.3.2
(Student’s theorem) Let with . The ratio

(8.26)

is then t-distributed with degrees of freedom.

Fig. 8.8 Probability density functions for different F-distributions


8.3.3 F-Distribution
Definition 8.3.3
Let X and Y be independent and -distributed random variables, then the
distribution of the ratio

(8.27)
follows the Fisher F -distribution with (m, n) degrees of freedom. The PDF of
the F-distribution is given in Eq. (C.9) in Appendix C.3.

If X is a -distributed random variable, then the ratio (8.27) is -distributed.


The square root of this ratio is -distributed since the square root of a -
distributed random variable is N(0, 1)-distributed. If W is F-distributed, ,
then 1 / W is -distributed. Figure 8.8 visualizes the and
distributions. The quantiles of different F-distributions can be obtained in R
using the qf(p,df1,df2) command.
One application of the F-distribution relates to the ratio of two sample
variances of two independent samples of size m and n, where each sample is an
i.i.d. sample from a normal population, i.e. and . For the
sample variances and from the two
populations, the ratio

is F-distributed with degrees of freedom in the numerator and


degrees of freedom in the denominator.

8.4 Key Points and Further Issues


8.5 Exercises
Exercise 8.1
A company producing cereals offers a toy in every sixth cereal package in
celebration of their 50th anniversary. A father immediately buys 20 packages.

(a) What is the probability of finding 4 toys in the 20 packages?

(b) What is the probability of finding no toy at all?

(c) The packages contain three toys. What is the probability that among the 5
packages that are given to the family’s youngest daughter, she finds two
toys?

Exercise 8.2
A study on breeding birds collects information such as the length of their eggs
(in mm). Assume that the length is normally distributed with mm and
. What is the probability of

(a) finding an egg with a length greater than 50 mm?

(b) finding an egg between 30 and 40 mm in length?


Calculate the results both manually and by using R.

Exercise 8.3
A dodecahedron is a die with 12 sides. Suppose the numbers on the die are 1–12.
Consider the random variable X which describes which number is shown after
rolling the die once. What is the distribution of X? Determine and .

Exercise 8.4
Felix states that he is able to distinguish a freshly ground coffee blend from an
ordinary supermarket coffee. One of his friends asks him to taste 10 cups of
coffee and tell him which coffee he has tasted. Suppose that Felix has actually no
clue about coffee and simply guesses the brand. What is the probability of at
least 8 correct guesses?
Exercise 8.5
An advertising board is illuminated by several hundred bulbs. Some of the bulbs
are fused or smashed regularly. If there are more than 5 fused bulbs on a day, the
owner of the board replaces them, otherwise not. Consider the following data
collected over a month which captures the number of days ( ) on which i bulbs
were broken:

Fused bulbs 0 1 2 3 4 5
6 8 8 5 2 1

(a) Suggest an appropriate distribution for X: “number of broken bulbs per


day”.

(b) What is the average number of broken bulbs per day? What is the variance?

(c) Determine the probabilities using the distribution you chose in (a)
and using the average number of broken bulbs you calculated in (b).
Compare the probabilities with the proportions obtained from the data.

(d) Calculate the probability that at least 6 bulbs are fused, which means they
need to be replaced.

(e) Consider the random variable Y: “time until next bulb breaks”. What is the
distribution of Y?

(f) Calculate and interpret .

Exercise 8.6
Marco’s company organizes a raffle at an end-of-year function. There are 4000
raffle tickets to be sold, of which 500 win a prize. The price of each ticket is
€1.50. The value of the prizes, which are mostly electrical appliances produced
by the company, varies between €80 and €250, with an average value of €142.

(a) Marco wants to have a 99 % guarantee of receiving three prizes. How


much money does he need to spend? Use R to solve the question.
(b) Use R to plot the function which describes the relationship between the
number of tickets bought and the probability of winning at least three
prizes.

(c) Given the value of the prizes and the costs of the tickets, is it worth taking
part in the raffle?

Exercise 8.7
A country has a ratio between male and female births of 1.05 which means that
51.22 % of babies born are male.

(a) What is the probability for a mother that the first girl is born during the first
three births?

(b) What is the probability of getting 2 girls among 4 babies?

Exercise 8.8
A fishermen catches, on average, three fish in an hour. Let Y be a random
variable denoting the number of fish caught in one hour and let X be the time
interval between catching two fishes. We assume that X follows an exponential
distribution.

(a) What is the distribution of Y?

(b) Determine and .

(c) Calculate and .

Exercise 8.9
A restaurant sells three different types of dessert: chocolate, brownies, yogurt
with seasonal fruits, and lemon tart. Years of experience have shown that the
probabilities with which the desserts are chosen are 0.2, 0.3, and 0.5,
respectively.
(a) What is the probability that out of 5 guests, 2 guests choose brownies, 1
guest chooses yogurt, and the remaining 2 guests choose lemon tart?

(b) Suppose two out of the five guests are known to always choose lemon tart.
What is the probability of the others choosing lemon tart as well?

(c) Determine the expectation and variance assuming a group of 20 guests.

Exercise 8.10
A reinsurance company works on a premium policy for natural disasters. Based
on experience, it is known that “number of natural disasters from October
to March” (winter) is Poisson distributed with . Similarly, the random
variable “number of natural disasters from April to September” (summer) is
Poisson distributed with . Determine the probability that there is at least 1
disaster during both summer and winter based on the assumption that the two
random variables are independent.

Exercise 8.11
Read Appendix C.3 to learn about the Theorem of Large Numbers and the
Central Limit Theorem.

(a) Draw 1000 realizations from a standard normal distribution using R and
calculate the arithmetic mean. Repeat this process 1000 times. Evaluate the
distribution of the arithmetic mean by drawing a kernel density plot and by
calculating the mean and variance of it.

(b) Repeat the procedure in (a) with an exponential distribution with .


Interpret your findings in the light of the Central Limit Theorem.

(c) Repeat the procedure in (b) using 10,000 rather than 1000 realizations.
How do the results change and why?
Source Toutenburg, H., Heumann, C., Induktive Statistik, 4th edition, 2007,
Springer, Heidelberg
Part III
Inductive Statistics
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_9

9. Inference
Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

9.1 Introduction
The first four chapters of this book illustrated how one can summarize a data set
both numerically and graphically. The validity of interpretations made from such
a descriptive analysis is valid only for the data set under consideration and
cannot necessarily be generalized to other data. However, it is desirable to make
conclusions about the entire population of interest and not only about the sample
data. In this chapter, we describe the framework of statistical inference which
allows us to infer from the sample data about the population of interest–at a
given, prespecified uncertainty level–and knowledge about the random process
generating the data.
Consider an example where the objective is to forecast an election outcome.
This requires us to determine the proportion of votes that each of the k
participating parties is going to receive, i.e. to calculate or estimate .
If it is possible to ask every voter about their party preference, then one can
simply calculate the proportions for each party. However, it is
logistically impossible to ask all eligible voters (which form the population in
this case) about their preferred party. It seems more realistic to ask only a small
fraction of voters and infer from their responses to the responses of the whole
population. It is evident that there might be differences in responses between the
sample and the population—but the more voters are asked, the closer we are to
the population’s preference, i.e. the higher the precision of our estimates for
(the meaning of “precision” will become clearer later in this
chapter). Also, it is intuitively clear that the sample must be a representative
sample of the voters’ population to avoid any discrepancy or bias in the
forecasting. When we speak of a representative sample, we mean that all the
characteristics present in the population are contained in the sample too. There
are many ways to get representative random samples. In fact, there is a branch of
statistics, called sampling theory, which studies this subject [see, e.g. Groves et
al. (2009) or Kauermann and Küchenhoff (2011) for more details]. A simple
random sample is one where each voter has an equal probability of being
selected in the sample and each voter is independently chosen from the same
population. In the following, we will assume that all samples are simple random
samples. To further formalize the election forecast problem, assume that we are
interested in the true proportions which each party receives on the election day.
It is practically impossible to make a perfect prediction of these proportions
because there are too many voters to interview, and moreover, a voter may
possibly make their final decision possibly only when casting the vote and not
before. The voter may change his/her opinion at any moment and may differ
from what he/she claimed earlier. In statistics, we call these true proportions
parameters of the population. The task is then to estimate these parameters on
the basis of a sample. In the election example, the intuitive estimates for the
proportions in the population are the proportions in the sample and we call them
sample estimates. How to find good and precise estimates are some of the
challenges that are addressed by the concept of statistical inference. Now, it is
possible to describe the election forecast problem in a statistical and operational
framework: estimate the parameters of a population by calculating the sample
estimates. An important property of every good statistical inference procedure is
that it provides not only estimates for the population parameters but also
information about the precision of these estimates.
Consider another example in which we would like to study the distribution of
weight of children in different age categories and get an understanding of the
“normal” weight. Again, it is not possible to measure the weight of all the
children of a specific age in the entire population of children in a particular
country. Instead, we draw a random sample and use methods of statistical
inference to estimate the weight of children in each age group. More specifically,
we have several populations in this problem. We could consider all boys of a
specific age and all girls of a specific age as two different populations. For
example, all 3-year-old boys will form one possible population. Then, a random
sample is drawn from this population. It is reasonable to assume that the
distribution of the weight of k-year-old boys follows a normal distribution with
some unknown parameters and . Similarly, another population of k-year-
old girls is assumed to follow a normal distribution with some unknown
parameters and . The indices kb and kg are used to emphasize that the
parameters may vary by age and gender. The task is now to calculate the
estimates of the unknown parameters (in the population) of the normal
distributions from the samples. Using quantiles, a range of “normal” weights can
then be specified, e.g. the interval from the 1 % quantile to the 99 % quantile of
the estimated normal distribution or, alternatively, all weights which are not
more than twice the standard deviation away from the mean. Children with
weights outside this interval may be categorized as underweight or overweight.
Note that we make a specific assumption for the distribution class; i.e. we
assume a normal distribution for the weights and estimate its parameters. We call
this a parametric estimation problem because it is based on distributional
assumptions. Otherwise, if no distributional assumptions are made, we speak of
a nonparametric estimation problem.

9.2 Properties of Point Estimators


As we discussed in the introduction, the primary goal in statistical inference is to
find a good estimate of (a) population parameter(s). The parameters are
associated with the probability distribution which is believed to characterize the
population; e.g. and are the parameters in a normal distribution . If
these parameters are known, then one can characterize the entire population. In
practice, these parameters are unknown, so the objective is to estimate them.
One can attempt to obtain them based on a function of the sample values. But
what does this function look like; and if there is more than one such function,
then which is the best one? What is the best approach to estimate the population
parameters on the basis of a given sample of data? The answer is given by
various statistical concepts such as bias, variability, consistency, efficiency,
sufficiency, and completeness of the estimates. We are going to introduce them
now.
Assume are the observations of a random sample from a
population of interest. The random sample represents the realized values of a
random variable X. It can be said that are the n observations collected
on the random variable X. Any function of random variables is called a statistic.
For example, , etc. are functions of ,
so they are a statistic. It follows that a statistic is also a random variable.
Consider a statistic T(X) which is used to estimate a population parameter
(which may be either a scalar or a vector). We say T(X) is an estimator of . To
indicate that we estimate using T(X), we use the “hat” symbol, i.e. we
write . When T is calculated from the sample values , we
write T(x) and call it an estimate of . It becomes clear that T(X) is a random
variable but T(x) is its observed value (dependent on the actual sample). For
example, is an estimator and a statistic, but
is its estimated value from the realized sample values . Since the
sample values are realizations from a random variable, each sample leads to a
different value of the estimate of the population parameter. The population
parameter is assumed to be a fixed value. Parameters can also be assumed to be
random, for example in Bayesian statistics, but this is beyond the scope of this
book.

9.2.1 Unbiasedness and Efficiency


Definition 9.2.1
An estimator T(X) is called an unbiased estimator of if

(9.1)
The index denotes that the expectation is calculated with respect to the
distribution whose parameter is . The bias of an estimator T(X) is defined as

(9.2)
It follows that an estimator is said to be unbiased if its bias is zero.

Definition 9.2.2
The variance of T(X) is defined as
(9.3)

Both bias and variance are measures which characterize the properties of an
estimator. In statistical theory, we search for “good” estimators in the sense that
the bias and the variance are as small as possible and therefore the accuracy is as
high as possible. Readers interested in a practical example may consult
Examples 9.2.1 and 9.2.2, or the explanations for Fig. 9.1.
It turns out that we cannot minimize both measures simultaneously as there
is always a so-called bias–variance tradeoff. A measure which combines bias and
variance into one measure is the mean squared error.

Definition 9.2.3
The mean squared error (MSE) of T (X) is defined as

(9.4)
The expression (9.4) can be partitioned into two parts: the variance and the
squared bias, i.e.
(9.5)
This can be proven as follows:

Note that the calculation is based on the result that the cross product term is zero.
The mean squared error can be used to compare different biased estimators.

Definition 9.2.4
An estimator is said to be MSE-better than another estimator for
estimating if

where and is the parameter space , i.e. the set of all possible values of .
Often, is or all positive real values . For example, for a normal
distribution, , can be any real value and has to be a number greater
than zero.
Unfortunately, we cannot find an MSE-optimal estimator in the sense that an
estimator is MSE-better than all other possible estimators for all possible values
of . This becomes clear if we define the constant estimator
(independent of the actual sample): if , i.e. if the constant value equals the
true population parameter we want to estimate, then the MSE of this constant
estimator is zero (but it will be greater than zero for all other values of , and the
bias increases more as we move c far away from the true ). Usually, we can
only find estimators which are locally best (in a certain subset of ). This is why
classical statistical inference restricts the search for best estimators to the class of
unbiased estimators. For unbiased estimators, the MSE is equal to the variance
of an estimator. In this context, the following definition is used for comparing
two (unbiased) estimators.

Definition 9.2.5
An unbiased estimator is said to be more efficient than another unbiased
estimator for estimating if

and

for at least one . It turns out that restricting our search of best estimators to
unbiased estimators is sometimes a successful strategy; i.e. for many problems, a
best or most efficient estimate can be found. If such an estimator exists, it is said
to be UMVU (uniformly minimum variance unbiased). Uniformly means that it
has the lowest variance among all other unbiased estimators for estimating the
population parameter(s) .

Consider the illustration in Fig. 9.1 to better understand the introduced concepts.
Suppose we throw three darts at a target and the goal is to hit the centre of the
target, i.e. the innermost circle of the dart board. The centre represents the
population parameter . The three darts play the role of three estimates
(based on different realizations of the sample) of the population parameter .
Four possible situations are illustrated in Fig. 9.1. For example, in Fig. 9.1b, we
illustrate the case of an estimator which is biased but has low variance: all three
darts are “far” away from the centre of the target, but they are “close” together. If
we look at Fig. 9.1a, c, we see that all three darts are symmetrically grouped
around the centre of the target, meaning that there is no bias; however, in
Fig. 9.1a there is much higher precision than in Fig. 9.1c. It is obvious that
Fig. 9.1a presents an ideal situation: an estimator which is unbiased and has
minimum variance.

Fig. 9.1 Illustration of bias and variance

Theorem 9.2.1
Let be an i.i.d. (random) sample of a random variable X with
population mean and population variance , for all
. Then the arithmetic mean is an unbiased estimator of
and the sample variance is an unbiased estimator of .

Note that the theorem holds, in general, for i.i.d. samples, irrespective of the
choice of the distribution of the ’s. Note again that we are looking at the
situation before we have any observations on X. Therefore, we again use capital
letters to denote that the ’s are random variables which are not known
beforehand (i.e. before we actually record the observations on our selected
sampling units).

Remark 9.2.1
The empirical variance is a biased estimate of and its bias
is .

Example 9.2.1
Let be identically and independently distributed variables whose
population mean is and population variance is . Then is an
unbiased estimator of . This can be shown as follows:

The variance of can be calculated as follows:

We conclude that is an unbiased estimator of and its variance is


irrespective of the choice of the distribution of X. We have learned about the
distribution of already in Chap. 8, see also Appendix C.3 for the Theorem of
Large Numbers and the Central Limit Theorem; however, we would like to
highlight the property of “unbiasedness” in the current context.

Now, we consider another example to illustrate that estimators may not always
be unbiased but may have the same variance.

Example 9.2.2
Let be identically and independently distributed variables whose
population mean is and population variance is . Then
is a biased estimator of . This can be shown as
follows:

However, the variance of is

If we compare the two estimators and , we see that


both have the same variance but the former is biased. The efficiency of both
estimators is thus the same. It further follows that the mean squared error of is
smaller than the mean squared error of because the MSE consists of the sum
of the variance and the squared bias. Therefore is MSE-better than . The
comparison of bias, variance and MSE tells us that we should prefer over
when estimating the population mean. This is intuitive, but the argument we
make is a purely statistical one.

Theorem 9.2.1 contains the following special cases:


The sample mean based on an i.i.d. random sample
from a normally distributed population is an unbiased
point estimator of .

The sample variance based on an i.i.d. random sample


from a normally distributed population is an unbiased
point estimator of . The sample variance is a biased
estimator for , but it is asymptotically unbiased in the sense that its bias
tends to zero as the sample size n tends to infinity.
The sample mean based on an i.i.d. random sample
from a Bernoulli distributed population B(1, p) is an unbiased
point estimator of the probability p.
For illustration, we show the validity of the third statement. Let us consider
an i.i.d. random sample , , from a Bernoulli distribution, where
if an event occurs and otherwise. Here, p is the probability of
occurrence of an event in the population, i.e. . Note that p is also the
population mean: , . The arithmetic mean
(relative frequency) is an unbiased estimator of p because

and thus, we can write the estimate of p as

(9.6)

Example 9.2.3
Suppose a random sample of size of the weight of 10-year-old children in
a particular city is drawn. Let us assume that the children’s weight in the
population follows a normal distribution . The sample provides the
following values of weights (in kg):
40.2, 32.8, 38.2, 43.5, 47.6, 36.6, 38.4, 45.5, 44.4, 40.3
34.6, 55.6, 50.9, 38.9, 37.8, 46.8, 43.6, 39.5, 49.9, 34.2
To obtain an estimate of the population mean , we calculate the arithmetic
mean of the observations as

because it is an unbiased estimator of . Similarly, we use to estimate


because it is unbiased in comparison to . Using as an estimate for for
the given observations, we get

The square root of 36.85 is approximately 6.07 which is the standard


deviation. Note that the standard deviation based on the sample values divided
by the square root of the sample size, i.e. , is called the standard error of
the mean (SEM). As already introduced in Chap. 3, we obtain these results in
R using the mean and var commands.

Example 9.2.4
A library draws a random sample of size members from the members’
database to see how many members have to pay a penalty for returning books
late, i.e. . It turns out that 39 members in the sample have to pay a penalty.
Therefore, an unbiased estimator of the population proportion of all members of
the library who return books late is

Remark 9.2.2
Unbiasedness and efficiency can also be defined asymptotically: we say, for
example, that an estimator is asymptotically unbiased, if the bias approaches
zero when the sample size tends to infinity. The concept of asymptotic efficiency
involves some mathematical knowledge which is beyond the intended scope of
this book. Loosely speaking, an asymptotic efficient estimator is an estimator
which achieves the lowest possible (asymptotic) variance under given
distributional assumptions. The estimators introduced in Sect. 9.3.1, which are
based on the maximum likelihood principle, have these properties (under certain
mathematically defined regularity conditions).

Next, we illustrate the properties of consistency and sufficiency of an estimator.

9.2.2 Consistency of Estimators


For a good estimator, as the sample size increases, the values of the estimator
should get closer to the parameter being estimated. This property of estimators is
referred to as consistency.

Definition 9.2.6
Let be a sequence of estimators for the parameter where
is a function of . The sequence is a
consistent sequence of estimators for if for every ,
or equivalently

This definition says that as the sample size n increases, the probability that is
getting closer to is approaching 1. This means that the estimator is getting
closer to the parameter as n grows larger. Note that there is no information on
how fast is converging to in the sense of convergence defined above.

Example 9.2.5
Let be identically and independently distributed variables with
expectation and variance . Then for , we have and
. For any , we can write the following:

where . Using Tschebyschev’s inequality (Theorem 7.4.1, p. 139), we


get , and therefore

and

provided is finite. Hence converges to and therefore is a


consistent estimator of .

Remark 9.2.3
We call this type of consistency weak consistency. Another definition is MSE
consistency, which says that an estimator is MSE consistent if as
. If the estimator is unbiased, it is sufficient that as . If
is MSE consistent, it is also weakly consistent. Therefore, it follows that
an unbiased estimator with its variance approaching zero as the sample size
approaches infinity is both MSE consistent and weakly consistent.

In Example 9.2.5, the variance of is which goes to zero as n goes


to and therefore is both weakly consistent and MSE consistent.

9.2.3 Sufficiency of Estimators


Sufficiency is another criterion to judge the quality of an estimator. Before
delving deeper into the subject matter, we first try to understand some basic
concepts.
Consider two independent random variables X and Y, each following a
distribution. We conclude that both X and Y contain information about .
Consider two estimators of as and . Suppose we want to
know whether to use or to estimate . We notice that
, ,
and . Using
the additivity property of the normal distribution, which was introduced in
Remark 8.2.2, we can say that and . So contains
information about , whereas does not contain any information about . In
other words, loses the information about . We call this property “loss of
information”.
If we want to make conclusions about using both X and Y, we need to
acknowledge that the dimension of them is 2. On the other hand, if we use or
equivalently , then we need to concentrate only on one variable and
we say that it has dimension 1. It follows that and provide the same
information about as provided by the entire sample on both X and Y. So we
can say that either or is sufficient to provide the same information about
that can be obtained on the basis of the entire sample. This is the idea behind
the concept of sufficiency and it results in the reduction of dimension. In general,
we can say that if all the information about contained in the sample of size n
can be obtained, for example, through the sample mean then it is sufficient to use
this one-dimensional summary statistic to make inference about .

Definition 9.2.7
Let be a random sample from a probability density function (or
probability mass function) . A statistic T is said to be sufficient for if the
conditional distribution of given is independent of .

The Neyman–Fisher Factorization Theorem provides a practical way to find


sufficient statistics.

Theorem 9.2.2
(Neyman–Fisher Factorization Theorem (NFFT)) Let be a random
sample from a probability density function (or probability mass function) .
A statistic is said to be sufficient for if and only if the joint
density of can be factorized as

where is nonnegative and does not involve ; and is a non-


negative function of which depends on only through t, which is a
particular value of T.
This theorem holds for discrete random variables too. Any one-to-one
function of a sufficient statistic is also sufficient. A function f is called one-to-
one if whenever then .

Example 9.2.6
Let be a random sample from where is unknown. We
attempt to find a sufficient statistic for . Consider the following function as the
joint distribution of (whose interpretation will become clearer in the
next section):

Here
Using the Neyman–Fisher Factorization Theorem, we conclude that
is a sufficient statistic for . Also,
is sufficient for as it is a one-to-one statistic of .
On the other hand, is not sufficient for as it is not a one-to-one function
of . The important point here is that is a function of the sufficient
statistic and hence a good estimator for . It is thus summarizing the sample
information about the parameter of interest in a complete yet parsimonious way.
Another, multivariate, example of sufficiency is given in Appendix C.4.

9.3 Point Estimation


In the previous section, we introduced and discussed various properties of
estimators. In this section, we want to show how one can find estimators with
good properties. In the general case, properties such as unbiasedness and
efficiency cannot be guaranteed for a finite sample. But often, the properties can
be shown to hold asymptotically.

9.3.1 Maximum Likelihood Estimation


We have used several estimators throughout the book without stating explicitly
that they are estimators. For example, we used the sample mean ( ) to estimate
in a distribution; we also used the sample proportion (relative
frequency) to estimate p in a B(1, p) distribution, etc. The obvious question is
how to obtain a good statistic to estimate an unknown parameter, for example
how to determine that the sample mean can be used to estimate . We need a
general framework for parameter estimation. The method of maximum
likelihood provides such an approach. For the purpose of illustration, we
introduce the method of maximum likelihood estimation with an example using
the Bernoulli distribution.
Example 9.3.1
Consider an i.i.d. random sample from a Bernoulli population
with and . The joint probability mass function for
a given set of realizations (i.e. the data) is

(9.7)

This is a function of given the parameter p. The product results


from the fact that the draws are independent and the fact that if
and if . That is, the term covers
results from both possible outcomes. Now, consider a random sample where the
values are known, for example . Then, (9.7)
can be seen as a function of p because is known. In this case, after
obtaining a sample of data, the function is called the likelihood function and can
be written as

(9.8)
The joint density function of is called the likelihood function. For
better understanding, consider a sample of size 5 with
. The likelihood (function) is

(9.9)
The maximum likelihood estimation principle now says that the estimator of p
is the value of p which maximizes the likelihood (9.8) or (9.9). In other words,
the maximum likelihood estimate is the value which maximizes the probability
of observing the realized sample from the likelihood function. In general, i.e. for
any sample, we have to maximize the likelihood function (9.9) with respect to p.
We use the well-known principle of maxima–minima to maximize the likelihood
function in this case. In principle, any other optimization procedure can also be
used, for example numerical algorithms such as the Newton–Raphson algorithm.
If the likelihood is differentiable, the first-order condition for the maximum is
that the first derivative with respect to p is zero. For maximization, we can
transform the likelihood by a strictly monotone increasing function. This
guarantees that the potential maximum is taken at the same point as in the
original likelihood. A good and highly common choice is the natural logarithm
since it transforms products in sums and sums are easy to differentiate by
differentiating each term in the sum. The log-likelihood in our example is
therefore

(9.10)
(9.11)
where denotes the natural logarithm function and we use the rules

Taking the first derivative of (9.10) with respect to p results in

(9.12)
Setting (9.12) to zero and solving for p leads to

The value of the second-order partial derivative of (9.9) with respect to p at


is negative which ensures that maximizes the likelihood function. It
follows from this example that the maximum likelihood estimate for p leads to
the well-known arithmetic mean. Figure 9.2 shows the likelihood function and
the log-likelihood function as functions of p, where . The figures show
that the likelihood function and the log-likelihood function have the same
maxima at .
Fig. 9.2 Illustration of the likelihood and log-likelihood function of a binomial distribution

Maximum likelihood estimators have some important properties: they are


usually consistent, asymptotically unbiased, asymptotically normally distributed,
asymptotically efficient, and sufficient. Even if they are not, a function of a
sufficient statistic can always be found which has such properties. This is the
reason why maximum likelihood estimation is popular. By “asymptotically” we
mean that the properties hold as n tends to infinity, i.e. as the sample size
increases. There might be other good estimators in a particular context, for
example estimators that are efficient and not only asymptotically efficient;
however, in general, the ML principle is a great choice in many circumstances.
We are going to use it in the following sections and chapters, for instance for
general point estimation and in the linear regression model (Chap. 11).

Remark 9.3.1
More examples of maximum likelihood estimators are given in Exercises
9.1–9.3.

9.3.2 Method of Moments


The method of moments is another well-known method to derive the estimators
for population parameters. Below, we outline this principle briefly by way of
example.
The idea is that the population parameters of interest can be related to the
moments (e.g. expectation, variance) of the distribution of the considered
random variables.
A simple case is the estimator for the expected value of a population
using an i.i.d. random sample . In this case, is the natural
moment estimator of . Further, since , an estimator of is
. Using as an estimator for , this results in the biased, but
asymptotically unbiased estimate

An extension of this method is the generalized method of moments (GMM) .


GMM estimators have interesting properties: under relatively week conditions
(not further discussed here), they are consistent and asymptotically normal, as
well as efficient in the class of those estimators that do not use any additional
information besides the information included in the moment conditions. Usually,
they require a two-step estimating approach or an iterative estimating procedure.
The least squares estimator for a linear regression model with i.i.d. random
errors, discussed in detail in Chap. 11, can be seen as a special case of a GMM
estimator.

9.4 Interval Estimation


9.4.1 Introduction
Let us first consider an example to understand what we mean by interval
estimation. Consider a situation in which a lady wants to know the time taken to
travel from her home to the train station. Suppose she makes 20 trips and notes
down the time taken. To get an estimate of the expected time, one can use the
arithmetic mean. Let us say min. This is the point estimate for the
expected travelling time. It may not be appropriate to say that she will always
take exactly 25 min to reach the train station. Rather the time may vary by a few
minutes each time. To take this into account, the time can be estimated in the
form of an interval: it may then be found that the time varies mostly between 20
and 30 min. Such a statement is more informative. Both expectation and
variation of the data are taken into account. The interval (20, 30 min) provides a
range in which most of the values are expected to lie. We call this concept
interval estimation.
A point estimate on its own does not take into account the precision of the
estimate. The deviation between the point estimate and the true parameter (e.g.
) can be substantial, especially when the sample size is small. To
incorporate the information about the precision of an estimate in the estimated
value, a confidence interval can be constructed. It is a random interval with
lower and upper bounds, and , such that the unknown parameter is
covered by a prespecified probability of at least :

(9.13)
The probability is called the confidence level or confidence
coefficient, is called the lower confidence bound or lower confidence
limit and is called the upper confidence bound or upper confidence
limit. It is important to note that the bounds are random and the parameter is a
fixed value. This is the reason why we say that the true parameter is covered by
the interval with probability and not that the probability that the interval
contains the parameter is . Please note that some software packages use the
term “error bar” when referring to confidence intervals.
Frequency interpretation of the confidence interval: Suppose N independent
samples , , of size n are sampled from the same population and
N confidence intervals of the form are calculated. If N is large
enough, then on an average of the intervals (9.13) cover the true
parameter.

Example 9.4.1
Let a random variable follow a normal distribution with and .
Suppose we draw a sample of observations repeatedly. The sample will
differ in each draw, and hence, the mean and the confidence interval will also
differ. The data sets are realizations from random variables. Have a look at
Fig. 9.3 which illustrates the mean and the 95 % confidence intervals for 6
random samples. They vary with respect to the mean and the confidence interval
width. Most of the means are close to , but not all. Similarly, most
confidence intervals, but not all, include . This is the idea of the frequency
interpretation of the confidence interval: different samples will yield different
point and interval estimates. Most of the times the interval will cover , but not
always. The coverage probability is specified by , and the frequency
interpretation means that we expect that (approximately) of the
intervals to cover the true parameter . In that sense, the location of the interval
will give us some idea about where the true but unknown population parameter
lies, while the length of the interval reflects our uncertainty about : the wider
the interval is, the higher is our uncertainty about the location of .

Fig. 9.3 Frequency interpretation of confidence intervals

We now introduce the following confidence intervals:


Confidence interval for the mean of a normal distribution.
Confidence interval for the probability p of a binomial random variable.
Confidence interval for the odds ratio.

9.4.2 Confidence Interval for the Mean of a Normal


Distribution
Confidence Interval for When is Known.
Let be an i.i.d. sample from a distribution where is
assumed to be known. We use the point estimate to estimate and
construct a confidence interval around the mean . Using the Central Limit
Theorem (Appendix C.3, p. 426), it follows that follows a
distribution. Therefore , and it follows that
(9.14)
where denotes the quantile of the standard normal distribution
N(0, 1). We solve this inequality for the unknown and get the desired
confidence interval as follows:

or

The confidence interval for is thus obtained as

(9.15)
This is known as confidence interval for or the confidence
interval for with confidence coefficient .
We can use the R function qnorm or Table C.1 to obtain , see also
Sects. 8.4, A.3, and C.7. For example, for and we get =
= 1.96 and = = 2.576 using qnorm(0.975) and
qnorm(0.995). This gives us the quantiles we need to determine a 95 % and
99 % confidence interval, respectively.

Example 9.4.2
We consider again Example 9.2.3 where we evaluated the weight of 10-year-old
children. Assume that the variance is known to be 36; then the upper and lower
limits of a 95 % confidence interval for the expected weight can be calculated
as follows:

We get the confidence interval = [39.34, 44.59]. With


confidence, the true parameter is covered by the interval [39.34, 44.59].
Confidence Interval for When is Unknown.
Let be an i.i.d. sample from where is assumed to be
unknown and is being estimated by the sample variance . We know from
Sect. 8.3.1 that

It can be shown that and are stochastically independent. Thus, we


know that

follows a t-distribution with degrees of freedom. We can use this result


to determine the confidence interval for as

or

The confidence interval for is thus obtained as

(9.16)
which is the confidence interval for or the confidence interval
for with confidence coefficient .
The interval (9.16) is, in general, wider than the interval (9.15) for identical
and identical sample size n, since the unknown parameter is estimated by
which induces additional uncertainty. The quantiles for the t-distribution can
be obtained using the R command qt or Table C.2.

Example 9.4.3
Consider Example 9.4.2 where we evaluated the weight of 10-year-old children.
We have already calculated the point estimate of as . With =
2.093, obtained via qt(0.975,19) or Table C.2, the upper and lower limits of
a 95 % confidence interval for are obtained as

Therefore, the confidence interval is = [39.13, 44.81]. In R, we can


use the conf.int value of the t.test command to get a confidence interval
for the mean (see also Example 10.3.3 for more details on t.test). The default
is a 95 % confidence interval, but it can be changed easily if desired:

There is no unique best way to draw the calculated confidence intervals in R.


Among many other options, one can simply work with the plot functionality or
use geom_errorbar in conjunction with a ggplot object created with the
library ggplot2 , or use the plotCI command in the library plotrix.

9.4.3 Confidence Interval for a Binomial Probability


Let be an i.i.d. sample from a Bernoulli distribution B(1, p). Then
has a binomial distribution B(n, p).
We have already introduced as an estimator for p:

From (8.8), we know that . Applying rule (7.33), the


variance of the estimator is

and it can be estimated by

Nowadays, the exact confidence intervals of the binomial distribution


function can be easily calculated using computer implementations. Nevertheless,
(i) for a sufficiently large sample size n, (ii) if p is not extremely low or high,
and (iii) if the condition is fulfilled, we can use an approximation
based on the normal distribution to calculate confidence intervals. To be more
specific, one can show that

(9.17)
This gives us

(9.18)
and we get a confidence interval for p as

(9.19)

Example 9.4.4
We look again at Example 9.2.4 where we evaluated the proportion of members
who had to pay a penalty. Out of all borrowers, 39 % brought back their books
late and thus had to pay a fee. A 95 % confidence interval for the probability p of
bringing back a book late can be constructed using the normal approximation,
since . With and ,
we get the 95 % confidence interval as

In R, an exact confidence interval can be found using the function


binom.test:

One can see that the exact and approximate confidence limits differ slightly
due to the normal approximation which approximates the exact binomial
probabilities.

9.4.4 Confidence Interval for the Odds Ratio


In Chap. 4, we introduced the odds ratio to determine the strength of association
between two binary variables. One may be interested in the dispersion of the
odds ratio and hence calculate a confidence interval for it. Recall the notation for
contingency tables:

Y Total (row)

a b

X c d

Total (column) n

In the spirit of the preceding sections, we can interpret the entries in this
contingency table as population parameters. For example, a describes the
absolute frequency of observations in the population for which and
. If we have a sample then we can estimate a by the number of observed
observations for which and . We can thus view to be an
estimator for a, to be an estimator for b, to be an estimator for c, and
to be an estimator for d. It follows that

(9.20)
serves as the point estimate for the population odds ratio . To
construct a confidence interval for the odds ratio, we need to work on a log-
scale. The log odds ratio,
(9.21)
takes the natural logarithm of the odds ratio. It is evident that it can be
estimated using the observed absolute frequencies of the joint frequency
distribution of X and Y:

(9.22)
It can be shown that follows approximately a normal distribution with
expectation and standard deviation

(9.23)
Following the reasoning explained in the earlier section on confidence
intervals for binomial probabilities, we can calculate the confidence
interval for under a normal approximation as follows:

(9.24)
Since we are interested in the confidence interval of the odds ratio, and not
the log odds ratio, we need to transform back the lower and upper bound of the
confidence interval as
(9.25)

Example 9.4.5
Recall Example 4.2.5 from Chap. 4 where we were interested in the association
of smoking with a particular disease. The data is summarized in the following
contingency table:

Smoking Total (row)


Yes No
Yes 34 66 100
Disease No 22 118 140
Total (column) 56 184 240

The odds ratio was estimated to be 2.76, and we therefore concluded that the
chances of having the particular disease is 2.76 times higher for smokers
compared with non-smokers. To calculate a 95 % confidence intervals, we need
, and

The confidence interval for the log odds ratio is

Exponentiation of the confidence interval bounds yields the 95 % confidence


interval for the odds ratio as

There are many ways to obtain the same results in R. One option is to use the
oddsratio function of the library epitools . Note that we need to specify
“wald” under the methods option to get confidence intervals which use the
normal approximation as we did in this case.

9.5 Sample Size Determinations


Confidence intervals help us estimating the precision of point estimates. What if
we are required to adhere to a prespecified precision level? We know that the
variance decreases as the sample size increases. In turn, confidence intervals
become narrower. On the other hand, increasing the sample size has its own
consequences. For example, the cost and time involved in setting up
experiments, or conducting a survey, increases. In these situations it is important
to find a balance between the variability of the estimates and the sample size. We
cannot control the variability in the data in most of the situations, but it is
possible to control the sample size and therefore the precision of our estimates.
For example, we can control the number of people to be interviewed in a survey
—given the resources which are available. We discuss how to determine the
number of observations needed to get a particular precision (length) of the
confidence interval. We find the answers to such questions using the formulae
for confidence intervals.
Sample Size Calculation for .
Let us consider the situation where we are interested in estimating the
population mean . The length of the confidence interval (9.15) for the point
estimate is

(9.26)
We would now like to fix the width of the confidence interval and come up
with a sample size which is required to achieve this width. Let us fix the length
of the confidence interval as

(9.27)
Assume we have knowledge of . The knowledge about can be
obtained, for example, through a pilot study or past experience with the
experiment. We are interested in obtaining the value of n for which a confidence
interval has a fixed confidence width of or less. Rearranging (9.27) gives us

(9.28)
This means a minimum or optimum sample size is

(9.29)
The sample size ensures that the confidence interval for has at
most length . But note that we have assumed that is known. If we do not
know (which is more likely in practice), we have to make an assumption
about it, e.g. by using an estimate from a former study, a pilot study, or other
external information. Practically, (9.28) is used in the case of known and
unknown .

Example 9.5.1
A call centre is interested in determining the expected length of a telephone call
as precisely as possible. The requirements are that the 95 % confidence interval
for should have a width of 1 min. Suppose that the call centre has developed a
pilot study in which was estimated to be 5 min. The sample size n that is
needed to estimate the expected length of the phone calls with the desired
precision is:

This means that at least 384 calls are required to get the desired confidence
interval width.

Sample Size Calculation for p.


We can follow the earlier reasoning and determine the optimum sample size
for a specific confidence interval width using the confidence interval definition
(9.19). Since the width of the confidence interval is

we get

(9.30)
Example 9.5.2
A factory may be interested in the probability of an error in an operating process.
The length of the confidence interval should be , i.e. . Suppose it is
speculated that the error probability is 10 %; we may then use as our
prior judgment for the true value of p. This yields

(9.31)
This means we need a sample size of at least 865 to obtain the desired width of
the confidence interval for p.

The above examples for both and p have shown us that without external
knowledge about the research question of interest, it is difficult to come up with
an appropriate sample size. Results may vary considerably depending on what
type of information is assumed to be known. With limited knowledge, it can be
useful to report results for different widths of confidence intervals and
hypothesized values of p or .
Sample size calculations can be highly complex in many practical situations
and may not remain as simple as in the examples considered here. For example,
Chap. 10 uses additional concepts in the context of hypothesis testing, such as
the power, which can be taken into consideration when estimating sample sizes.
However, in this case, calculations and interpretations become more difficult and
complex. A detailed overview of sample size calculations can be found in Chow
et al. (2007) and Bock (1997).

9.6 Key Points and Further Issues


9.7 Exercises
Exercise 9.1
Consider an i.i.d. sample of size n from a Po distributed random variable X.

(a) Determine the maximum likelihood estimate for .

(b) What does the log-likelihood function look like for the following
realizations: Plot the function using R.
Hint: The curve command can be used to plot functions.
(c) Use the Neyman–Fisher Factorization Theorem to argue that the maximum
likelihood estimate obtained in (a) is a sufficient statistic for .

Exercise 9.2
Consider an i.i.d. sample of size n from a distributed random variable X.

(a) Determine the maximum likelihood estimator for under the assumption
that .

(b) Now determine the maximum likelihood estimator for for an arbitrary
.

(c) What is the maximum likelihood estimate for ?

Exercise 9.3
Let be n i.i.d. random variables which follow a uniform
distribution, . Write down the likelihood function and argue, without
differentiating the function, what the maximum likelihood estimate of is.

Exercise 9.4
Let be n i.i.d. random variables which follow an exponential
distribution. An intelligent statistician proposes to use the following two
estimators to estimate :

(i) with and ,

(ii) .

(a) Are both and (asymptotically) unbiased for ?


(b) Calculate the mean squared error of both estimators. Which estimator is
more efficient?

(c) Is MSE consistent, weakly consistent, both, or not consistent at all?

Exercise 9.5
A national park in Namibia determines the weight (in kg) of a sample of
common eland antelopes:
450 730 700 600 620 660 850 520 490 670 700 820
910 770 760 620 550 520 590 490 620 660 940 790
Calculate

(a) the point estimate of and and

(b) the confidence interval for ( ).


under the assumption that the weight is normally distributed.

(c) Use R to reproduce the results from (b).

Exercise 9.6
We are interested in the heights of the players of the two basketball teams “Brose
Baskets Bamberg” and “Bayer Giants Leverkusen” as well as the football team
“SV Werder Bremen”. The following summary statistics are given:

N Minimum Maximum Mean Std. dev.


Bamberg 16 185 211 199.06 7.047
Leverkusen 14 175 210 196.00 9.782
Bremen 23 178 195 187.52 5.239

Calculate a 95 % confidence interval for for all three teams and interpret the
results.
Exercise 9.7
A married couple tosses a coin after each dinner to determine who has to wash
the dishes. If the coin shows “head”, then the husband has to wash the dishes,
and if the coin shows “tails”, then the wife has to wash the dishes. After 98
dinners, the wife notes that the coin has shown head 59 times.

(a) Estimate the probability that the wife has to wash the dishes.

(b) Calculate and interpret the 95 % confidence interval for p.

(c) How many dinners are needed to estimate the true probability for the coin
showing “head” with a precision of ±0.5 % under the assumption that the
coin is fair?

Exercise 9.8
Suppose 93 out of 104 pupils have passed the final examination at a certain
school.

(a) Calculate a 95 % confidence interval for the probability of failing the


examination both by manual calculations and by using R, and compare the
results.

(b) At county level 3.2 % of pupils failed the examination. Are the school’s
pupils worse than those in the whole county?

Exercise 9.9
To estimate the audience rate for several TV stations, 3000 households are asked
to allow a device, which records which TV station is watched, to be installed on
their TVs. 2500 agreed to participate. Assume it is of interest to estimate the
probability of someone switching on the TV and watching the show “Germany’s
next top model”.

(a) What is the precision with which the probability can be estimated?

(b) What source of bias could potentially influence the estimates?


Exercise 9.10
An Olympic decathlon athlete is interested in his performance compared with
the performance of other athletes. He is a good runner and interested in his
100 m results compared with those of other athletes.

(a) He uses the decathlon data from this book (Appendix A.2) to come up with
. What sample size does he need to calculate a 95 %
confidence interval for the mean running time which is precise to s?

(b) Calculate a 95 % confidence interval for the mean running time ( )


of the 30 athletes captured in the data set in Chap. A.2. Interpret the width
of this interval compared with the width determined in a).

(c) The runner’s own best time is 10.86 s. He wants to be among the best 10 %
of all athletes. Calculate an appropriate confidence interval to compare his
time with the 10 % best times.

Exercise 9.11
Consider the pizza delivery data described in Chap. A.4. We distinguish between
pizzas delivered on time (i.e. in less than 30 min) and not delivered on time (i.e.
in more than 30 min). The contingency table for delivery time and operator looks
as follows:

Operator Total
Laura Melissa
min 163 151 314

min 475 477 952

Total 638 628 1266

(a) Calculate and interpret the odds ratio and its 95 % confidence interval.

(b) Reproduce the results from (a) using R.


© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_10

10. Hypothesis Testing


Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

10.1 Introduction
We introduced point and interval estimation of parameters in the previous
chapter. Sometimes, the research question is less ambitious in the sense that we
are not interested in precise estimates of a parameter, but we only want to
examine whether a statement about a parameter of interest or the research
hypothesis is true or not (although we will see later in this chapter that there is a
connection between confidence intervals and statistical tests, called duality).
Another related issue is that once an analyst estimates the parameters on the
basis of a random sample, (s)he would like to infer something about the value of
the parameter in the population. Statistical hypothesis tests facilitate the
comparison of estimated values with hypothetical values.

Example 10.1.1
As a simple example, consider the case where we want to find out whether the
proportion of votes for a party P in an election will exceed 30 % or not.
Typically, before the election, we will try to get representative data about the
election proportions for different parties (e.g. by telephone interviews) and then
make a statement like “yes”, we expect that P will get more than 30 % of the
votes or “no”, we do not have enough evidence that P will get more than 30 % of
the votes. In such a case, we will only know after the election whether our
statement was right or wrong. Note that the term representative data only means
that the sample is similar to the population with respect to the distributions of
some key variables, e.g. age, gender, and education. Since we use one sample to
compare it with a fixed value (30 %), we call it a one-sample problem .

Example 10.1.2
Consider another example in which a clinical study is conducted to compare the
effectiveness of a new drug (B) to an established standard drug (A) for a specific
disease, for example too high blood pressure. Assume that, as a first step, we
want to find out whether the new drug causes a higher reduction in blood
pressure than the already established older drug. A frequently used study design
for this question is a randomized (i.e. patients are randomly allocated to one of
the two treatments) controlled clinical trial (double blinded, i.e. neither the
patient nor the doctor know which of the drugs a patient is receiving during the
trial), conducted in a fixed time interval, say 3 months. A possible hypothesis is
that the average change in the blood pressure in group B is higher than in group
A, i.e. where and is the average blood pressure
at baseline before measuring the blood pressure again after 3 months ( ). Note
that we expect both the differences and to be positive, since otherwise we
would have some doubt that either drug is effective at all. As a second step (after
statistically proving our hypothesis), we are interested in whether the
improvement of B compared to A is relevant in a medical or biological sense and
is valid for the entire population or not. This will lead us again to the estimation
problems of the previous chapter, i.e. quantifying an effect using point and
interval estimation. Since we are comparing two drugs, we need to have two
samples from each of the drugs; hence, we have a two-sample problem . Since
the patients receiving A are different from those receiving B in this example, we
refer to it as a “two-independent-samples problem”.

Example 10.1.3
In another example, we consider an experiment in which a group of students
receives extra mathematical tuition. Their ability to solve mathematical problems
is evaluated before and after the extra tuition. We are interested in knowing
whether the ability to solve mathematical problems increases after the tuition, or
not. Since the same group of students is used in a pre–post experiment, this is
called a “two-dependent-samples problem” or a “paired data problem”.

10.2 Basic Definitions


10.2.1 One- and Two-Sample Problems
In one-sample problems, the data is usually assumed to arise as one sample from
a defined population. In two-sample problems, the data originates in the form of
two samples possibly from two different populations. The heterogeneity is often
modelled by assuming that the two populations only differ in some parameters or
key quantities such as expectation (i.e. mean), median, or variance. As in our
introductory example, the samples can either be independent (as in the drug
Example 10.1.2) or dependent (as in the evaluation Example 10.1.3).

10.2.2 Hypotheses
A researcher may have a research question for which the truth about the
population of interest is unknown. Suppose data can be obtained using a survey,
observation, or an experiment: if, given a prespecified uncertainty level, a
statistical test based on the data supports the hypothesis about the population, we
say that this hypothesis is statistically proven. Note that the research question
has to be operationalized before it can be tested by a statistical test. Consider the
drug Example 10.1.2: we want to examine whether the new drug B has a greater
blood pressure lowering effect than the standard drug A. We have several options
to operationalize this research question into a statistical set-up. One is to test
whether the average reduction (from baseline to 3 months) of the blood pressure
is higher (and positive) for drug B than drug A. We then state our hypotheses in
terms of expected values (i.e. ). Why do we have to use the expected values
and not simply compare the arithmetic means ? The reason is that the
superiority of B shown in the sample will only be valid for this sample and not
necessarily for another sample. We need to show the superiority of B in the
entire population, and hence, our hypothesis needs to reflect this. Another option
would be, for example, to use median changes in blood pressure values instead
of mean changes in blood pressure values. An important point is that the
research hypothesis which we want to prove has to be formulated as the
statistical alternative hypothesis , often denoted by . The reason for this will
become clearer later in this chapter. The opposite of the research hypothesis has
to be formulated as the statistical null hypothesis , denoted by . In the drug
example, the alternative and null hypotheses are, respectively,

and

We note that the two hypotheses are disjoint and the union of them covers all
possible differences of and . There is a boundary value ( ) which
separates the two hypotheses. Since we want to show the superiority of B, the
hypothesis was formulated as a one-sided hypothesis. Note that there are
different ways to formulate two-sample hypotheses; for example, is
equivalent to . In fact, it is very common to formulate two-sample
hypotheses as differences, which we will see later in this chapter.

10.2.3 One- and Two-Sided Tests


We distinguish between one-sided and two-sided hypotheses and tests. In the
previous section, we gave an example of a one-sided test.
For an unknown population parameter (e.g. ) and a fixed value (e.g.
5), the following three cases have to be distinguished:

Case Null hypothesis Alternative hypothesis


(a) Two-sided test problem

(b) One-sided test problem

(c) One-sided test problem

Example 10.2.1
One-sample problems often test whether a target value is achieved or not. For
example, consider the null hypothesis as
average filling weight of packages of flour = 1 kg
average body height (men) = 178 cm.
The alternative hypothesis is formulated as deviation from the target
value. If deviations in both directions are interesting, then is formulated as a
two-sided hypothesis,
average body height (men) 178 cm.
If deviations in a specific direction are the subject of interest, then is
formulated as a one-sided hypothesis, for example,
average filling weight of flour packages is lower than 1 kg.
average filling weight of flour packages is greater than 1 kg.
Two-sample problems often examine differences of two samples. Suppose
the null hypothesis is related to the average weight of flour packages filled by
two machines, say 1 and 2. Then, the null hypothesis is
average weight of flour packages filled by machine 1 = average weight
of flour packages filled by machine 2.
Then, can be formulated as a one-sided or two-sided hypothesis. If we
want to prove that machine 1 and machine 2 have different filling weights, then
would be formulated as a two-sided hypothesis
average filling weight of machine 1 average filling weight of
machine 2.
If we want to prove that machine 1 has lower average filling weight than
machine 2, would be formulated as a one-sided hypothesis
average filling weight of machine 1 < average filling weight of
machine 2.
If we want to prove that machine 2 has lower filling weight than machine 1,
would be formulated as a one-sided hypothesis
average filling weight of machine 1 > average filling weight of
machine 2.

Remark 10.2.1
Note that we have not considered the following situation: ,
. In general, with the tests described in this chapter, we cannot prove the equality
of a parameter to a predefined value and neither can we prove the equality of
two parameters, as in , . We can, for example, not prove
(statistically) that machines 1 and 2 in the previous example provide equal filling
weight. This would lead to the more complex class of equivalence tests , which
is a topic beyond the scope of this book.

10.2.4 Type I and Type II Error


If we undertake a statistical test, two types of error can occur.
The hypothesis is true but is rejected; this error is called type I error .
The hypothesis is not rejected although it is wrong; this is called type II
error.
When a hypothesis is tested, then the following four situations are possible:

is true is not true

is not rejected Correct decision Type II error

is rejected Type I error Correct decision

The significance level is the probability of type I error, , which is


the probability of rejecting (accepting ) if is true. If we construct a test,
the significance level is prespecified, e.g. . A significance test is
constructed such that the probability of a type I error does not exceed while
the probability of a type II error depends on the true but unknown parameter
values in the population(s) and the sample size. Therefore, the two errors are not
symmetrically treated in a significance test. In fact, the type II error ,
is not controlled by the construction of the test and can become very
high, sometimes up to . This is the reason why a test not rejecting is not
a (statistical) proof of . In mathematical statistics, one searches for the best
test which maintains and minimizes . Minimization of both and
simultaneously is not possible. The reason is that when increases then
decreases and vice versa. So one of the errors needs to be fixed and the other
error is minimized. Consequently, the error which is considered more serious is
fixed and then the other error is minimized. The tests discussed in the below
sections are obtained based on the assumption that the type I error is more
serious than the type II error. So the test statistics are obtained by fixing and
then minimizing . In fact, the null hypothesis is framed in such a way that it
implies that the type I error is more serious than the type II error. The probability
is called the power of the test. It is the probability of making a
decision in favour of the research hypothesis , if it is true, i.e. the probability
of detecting a correct research hypothesis.

10.2.5 How to Conduct a Statistical Test


In general, we can follow the steps described below to test a hypothesis about a
population parameter based on a sample of data.

(1) Define the distributional assumptions for the random variables of interest,
and specify them in terms of population parameters (e.g. or and ).
This is necessary for parametric tests. There are other types of tests, so-
called nonparametric tests, where the assumptions can be relaxed in the
sense that we do not have to specify a particular distribution, see
Sect. 10.6ff. Moreover, for some tests the distributional assumptions can be
relaxed if the sample size is large.

(2) Formulate the null hypothesis and the alternative hypothesis as described in
Sects. 10.2.2 and 10.2.3.

(3) Fix a significance value (often called type I error) , for example ,
see also Sect. 10.2.4.

(4) Construct a test statistic . The distribution of T has to


be known under the null hypothesis . We note again that
refers to the random variables before drawing the actual sample and
are the realized values (observations) in the sample.

(5) Construct a critical region K for the statistic T, i.e. a region where—if T
falls in this region— is rejected, such that

The notation means that this inequality must hold for all parameter
values that belong to the null hypothesis . Since we assume that we
know the distribution of under , the critical region is defined by
those values of which are unlikely (i.e. with probability of less than
) to be observed under the null hypothesis. Note that although T(X) is a
random variable, K is a well-defined region, see Fig. 10.1 for an example.

(6) Calculate based on the realized sample values


.

(7) Decision rule: if t(x) falls into the critical region K, the null hypothesis
is rejected. The alternative hypothesis is then statistically proven. If t(x)
falls outside the critical region, is not rejected.

The next two paragraphs show how to arrive at the test decisions from step
7 in a different way. Readers interested in an example of a statistical test
may jump to Sect. 10.3.1 and possibly also Example 10.3.1.

10.2.6 Test Decisions Using the p-Value


Statistical software usually does not show us all the steps of hypothesis testing as
outlined in Sect. 10.2.5. It is common that instead of calculating and reporting
the critical values, the test statistic is printed together with the so-called p-value .
It is possible to use the p-value instead of critical regions for making test
decisions. The p-value of the test statistic is defined as follows:

It can be interpreted as the probability of observing results equal to, or more


extreme than those actually observed if the null hypothesis was true. Then, the
decision rule is
is rejected if the p-value is smaller than the prespecified significance
level . Otherwise, cannot be rejected.

Example 10.2.2
Assume that we are dealing with a two-sided test and assume further that the test
statistic T(x) is N(0, 1)-distributed under . The significance level is . If
we observe, for example, , then the p-value is . This can be
calculated in R as

because pnorm() is used to calculate , and therefore, 1-pnorm()


can be used to calculate . We have to multiply with two because we are
dealing with a two-sided hypothesis. The result is . Therefore,
is rejected. The one-sided p-value is half of the two-sided p-value, i.e.
, and is not necessarily reported by R. It is
therefore important to look carefully at the R output when dealing with one-sided
hypotheses.

The p-value is sometimes also called the significance, although we prefer the
term p-value. We use the term significance only in the context of a test result: a
test is (statistically) significant if (and only if) can be rejected.
Unfortunately, the p-value is often over-interpreted: both a test and the p-
value can only provide a yes/no decision: either is rejected or not.
Interpreting the p-value as the probability that the null hypothesis is true is
wrong! It is also incorrect to say that the p-value is the probability of making an
error during the test decision. In our (frequentist) context, hypotheses are true or
false and no probability is assigned to them. It can also be misleading to speak of
“highly significant” results if the p-value is very small. A last remark: the p-
value itself is a random variable: under the null hypothesis, it follows a uniform
distribution, i.e. .

10.2.7 Test Decisions Using Confidence Intervals


There is an interesting and useful relationship between confidence intervals and
hypothesis tests. If the null hypothesis is rejected at the significance level ,
then there exists a % confidence interval which yields the same
conclusion as the test: if the appropriate confidence interval does not contain the
value targeted in the hypothesis, then is rejected. We call this duality . For
example, recall Example 10.1.2 where we were interested in whether the average
change in blood pressure for drug B is higher than for drug A, i.e. .
This hypothesis is equivalent to . In the following section, we
develop tests to decide whether is statistically significant or not.
Alternatively, we could construct a % confidence interval for the
difference and evaluate whether the interval contains or not; if yes,
we accept ; otherwise, we reject it. For some of the tests introduced in
following section, we refer to the confidence intervals which lead to the same
results as the corresponding test.

10.3 Parametric Tests for Location Parameters


10.3.1 Test for the Mean When the Variance is Known
(One-Sample Gauss Test)
We develop a hypothesis test to test whether the unknown mean (expectation)
of a -distributed random variable X either differs from a specific value
or is smaller (or greater) than . We assume that the variance is
known. We apply the scheme of Sect. 10.2.5 step by step to develop the test
procedure and then give an illustrative example.
1. Distributional assumption: The random variable X follows a -
distribution with known variance . We assume that an i.i.d. random sample is
drawn from where the s follow the same distribution as X,
.
2. Define any of the following set of hypotheses and :

3. Specify the probability of a type I error : Often is chosen.


4. Construct a test statistic: The unknown mean, i.e. the expectation , is
usually estimated by the sample mean . We already know that if the s are
i.i.d., then the sample mean is normally distributed. Under the assumption that
is true,
where means the “distribution under ”. If we standardize the mean
under , we get a N(0, 1)-distributed test statistic

see also Theorem 7.3.2. Note that follows a normal distribution even if
the s are not normally distributed and if n is large enough which follows from
the Central Limit Theorem (Appendix C.3). One can conclude that the
distributional assumption from step 1 is thus particularly important for small
samples, but not necessarily important for large samples. As a rule of thumb,
is considered to be a large sample. This rule is based on the knowledge
that a t-distribution with more than 30 degrees of freedom gets very close to a
N(0, 1)-distribution.
5. Critical region: Since the test statistic is N(0, 1)-distributed, we get
the following critical regions, depending on the hypothesis:

Case Critical region K

(a)

(b)

(c)

Fig. 10.1 Critical region of a two-sided one-sample Gauss-test : versus : . The

critical region has probability mass if is true


For case (a) with : and : , we are interested in extreme
values of the test statistic on both tails: very small values and very large values
of the test statistic give us evidence that is wrong (because the statistic is
mainly driven by the difference of the sample mean and the test value for a
fixed variance), see Fig. 10.1. In such a two-sided test, when the distribution of
the test statistic is symmetric, we divide the critical region into two equal parts
and assign each region of size to the left and right tails of the distribution.
For , 2.5 % of the most extreme values towards the right end of the
distribution and 2.5 % of the most extreme values towards the left end of the
distribution give us enough evidence that is wrong and can be rejected and
that is accepted. It is also clear why is the probability of a type I error: the
most extreme values in the two tails together have 5 % probability and are just
the probability that the test statistic falls into the critical region although is
true. Also, these areas are those which have the least probability of occurring if
is true. For , we get .
For case (b), only one direction is of interest. The critical region lies on the
right tail of the distribution of the test statistic. A very large value of the test
statistic has a low probability of occurrence if is true. An illustration is given
in Fig. 10.2: for , we get and any values greater than 1.64 are
unlikely to be observed under . Analogously, the critical region for case (c) is
constructed. Here, the shaded area (critical region) is on the left-hand side. In
this case, for , we get .

Fig. 10.2 Critical region of a one-sided one-sample Gauss test : versus : . The

critical region has probability mass if is true


6. Realization of the test statistic: For an observed sample , the
arithmetic mean

is used to calculate the realized (observed) test statistic


as

7. Test decision: If the realized test statistic from step 6 falls into the critical
region, is rejected (and therefore, is statistically proven). Table 10.1
summarizes the test decisions depending on t(x) and the quantiles defining the
appropriate critical regions.
Table 10.1 Rules to make test decisions for the one-sample Gauss test (and the two-sample Gauss test, the
one-sample approximate binomial test, and the two-sample approximate binomial test—which are all
discussed later in this chapter)

Case Reject if

(a)

(b)

(c)

Example 10.3.1
A bakery supplies loaves of bread to supermarkets. The stated selling weight
(and therefore the required minimum expected weight) is kg. However, not
every package weighs exactly 2 kg because there is variability in the weights. It
is therefore important to find out if the average weight of the loaves is
significantly smaller than 2 kg. The weight X (measured in kg) of the loaves is
assumed to be normally distributed. We assume that the variance is
known from experience. A supermarket draws a sample of loaves and
weighs them. The average weight is calculated as kg. Since the
supermarket wants to be sure that the weights are, on average, not lower than
2 kg, a one-sided hypothesis is appropriate and is formulated as :
kg versus : kg. The significance level is specified as , and
therefore, . The test statistic is calculated as

The null hypothesis is not rejected, since .

Interpretation: The sample average kg is below the target value of


kg. But there is not enough evidence to reject the hypothesis that the sample
comes from a -distributed population. The probability to observe a
sample of size with an average of at most 1.97 in a -distributed
population is greater than . The difference between kg and
the target value kg is not statistically significant.

Remark 10.3.1
The Gauss test assumes the variance to be known, which is often not the case in
practice. The t-test (Sect. 10.3.2) assumes that the variance needs to be
estimated. The t-test is therefore commonly employed when testing hypotheses
about the mean. Its usage is outlined below. In R, the command Gauss.test
from the library compositions offers an implementation of the Gauss test.

10.3.2 Test for the Mean When the Variance is Unknown


(One-Sample t-Test)
If the variance is unknown, hypotheses about the mean of a normal random
variable can be tested in a similar way to the one-sample Gauss test.
The difference is that the unknown variance is estimated from the sample. An
unbiased estimator of is the sample variance

The test statistic is therefore

which follows a t-distribution with degrees of freedom if is true, as we


know from Theorem 8.3.2.
Critical regions and test decisions
Since follows a t-distribution under , the critical regions refer to the
regions of the t-distribution which are unlikely to be observed under :

Case Critical region K

(a)

(b)

(c)

The hypothesis is rejected if the realized test statistic, i.e.

falls into the critical region. The critical regions are based on the appropriate
quantiles of the t-distribution with degrees of freedom, as outlined in
Table 10.2.
Table 10.2 Rules to make test decisions for the one-sample t-test (and the two-sample t-test, and the paired
t-test, both explained below)
Case Reject , if

(a)

(b)

(c)

Example 10.3.2
We again consider Example 10.3.1. Now we assume that the variance of the
loaves is unknown. Suppose a random sample of size has an arithmetic
mean of and a sample variance of . We want to test
whether this result contradicts the two-sided hypothesis : , that is case
(a). The significance level is fixed at . For the realized test statistic t(x),
we calculate
is not rejected since , where the quantiles are
defining the critical region (see Table C.2 or use R: qt(0.975,19)). The
same results can be obtained in R using the t.test() function, see Example
10.3.3 for more details. Or, we can directly calculate the (two-sided) p-value as

This yields a p-value of 0.1260951 which is not smaller than , and


therefore, is not rejected.

10.3.3 Comparing the Means of Two Independent


Samples
In a two-sample problem, we may be interested in comparing the means of two
independent samples. Assume that we have two samples of two normally
distributed variables and of size and , i.e.
are i.i.d. with the same distribution as X and are i.i.d.
with the same distribution as Y. We can specify the following hypotheses:

Case Null hypothesis Alternative hypothesis


(a) Two-sided test problem

(b) One-sided test problem

(c) One-sided test problem

We distinguish another three cases:

1. and are known.

2. and are unknown, but they are assumed to be equal, i.e. .

3. Both and are unknown and unequal ( ).

Case 1: The variances are known (two-sample Gauss test).


If the null hypothesis : is true, then, using the usual rules for the
normal distribution and the independence of the samples,

and

It follows that the test statistic

(10.1)

follows a standard normal distribution, . The realized test


statistic is

(10.2)

The test procedure is identical to the procedure of the one-sample Gauss test
introduced in Sect. 10.3.1; that is, the test decision is based on Table 10.1.
Case 2: The variances are unknown, but equal (two-sample t -test).
We denote the unknown variance of both distributions as (i.e. both the
populations are assumed to have variance ). We estimate by using the
pooled sample variance where each sample is assigned weights relative to the
sample size:

(10.3)
The test statistic

(10.4)
with S as in (10.3) follows a t-distribution with degrees of freedom
if is true. The realized test statistic is
(10.5)
The test procedure is identical to the procedure of the one-sample t-test; that
is, the test decision is based on Table 10.2.
Case 3: The variances are unknown and unequal (Welch test).
We test : versus : given and both and
are unknown. This problem is also known as the Behrens–Fisher problem and is
the most frequently used test when comparing two means in practice. The test
statistic can be written as

(10.6)

which is approximately t-distributed with v degrees of freedom:

(10.7)

where and are the estimated values of and


, respectively. The test procedure, using the observed test
statistic

(10.8)

is identical to the procedure of the one-sample t-test; that is, the test decision
is based on Table 10.2 except that the degrees of freedom are not but v. If v
is not an integer, it can be rounded off to an integer value.

Example 10.3.3
A small bakery sells cookies in packages of 500 g. The cookies are handmade
and the packaging is either done by the baker himself or his wife. Some
customers conjecture that the wife is more generous than the baker. One
customer does an experiment: he buys packages of cookies packed by the baker
and his wife on 16 different days and weighs the packages. He gets the following
two samples (one for the baker, one for his wife).

Weight (wife) (X) 512 530 498 540 521 528 505 523
Weight (baker) (Y) 499 500 510 495 515 503 490 511

We want to test whether the complaint of the customers is justified. Let us


start with the following simple hypotheses:

i.e. we only want to test whether the weights are different, not that the wife is
making heavier cookie packages. Since the variances are unknown, we assume
that case 3 is the right choice. We calculate and obtain , ,
, and . The test statistic is:

The degrees of freedom are:

Since |t(x)| = 2.91 > 2.18 = , it follows that is rejected. Therefore,


is statistically significant. This means that the mean weight of the wife’s
packages is different from the mean weight of the baker’s packages. Let us refine
the hypothesis and try to find out whether the wife’s packages have a higher
mean weight. The hypotheses are now:

The test statistic remains the same but the critical region and the degrees of
freedom change. Thus, is rejected if . Using
and t(x, y) = 2.91, it follows that the null hypothesis can be rejected. The mean
weight of the wife’s packages is greater than the mean weight of the baker’s
packages.
In R, we would have obtained the same result using the t.test command:
Note that we have to specify the alternative hypothesis under the option
alter native. The output shows us the test statistic (2.9058), the degrees of
freedom (11.672), the alternative hypothesis—but not the decision rule. We
know that is rejected if , so the decision is easy in this case: we
simply have to calculate using qt(0.95,12) in R. A simpler way to
arrive at the same decision is to use the p-value. We know that is rejected if
which is the case in this example. It is also worthwhile mentioning that R
displays the hypotheses slightly differently from ours: our alternative hypothesis
is which is identical to the statement , as shown by R, see also
Sect. 10.2.2.
If we specify two.sided as an alternative (which is the default), a
confidence interval for the mean difference is also part of the output:

It can be seen that the confidence interval of the difference does not cover
the “0”. Therefore, the null hypothesis is rejected. This is the duality property
referred to earlier in this chapter: the test decision is the same, no matter whether
one evaluates (i) the confidence interval, (ii) the test statistic, or (iii) the p-value.
Any kind of t-test can be calculated with the t.test command: for
example, the two-sample t-test requires to specify the option
var.equal=TRUE while the Welch test is calculated when the (default) option
var.equal=FALSE is set. We can also conduct a one-sample t-test. Suppose
we are interested in whether the mean weight of the wife’s packages of cookies
is greater than 500 g; then, we could test the hypotheses:

In R, we simply have to specify :


which gives us

10.3.4 Test for Comparing the Means of Two Dependent


Samples (Paired t-Test)
Suppose there are two dependent continuous random variables X and Y with
and . They could be dependent because we measure the same
variable twice on the same subjects at different times. Typically, this is the case
in pre–post experiments, for example when we measure the weight of a person
before starting a special diet and after finishing the diet; or when evaluating
household expenditures on electronic appliances in two consecutive years. We
then say that the samples are paired, or dependent. Since the same variable is
measured twice on the same subject, it makes sense to calculate a difference
between the two respective values. Let denote the random variable
“difference of X and Y”. If : is true, then the expected difference is
zero, and we get . This means testing is identical to
testing . We further assume that D is normally distributed if :
is true (or equivalently if : is true), i.e. . For a
random sample of the differences, the test statistic

(10.9)
is t-distributed with degrees of freedom. The sample mean is
and the sample variance is

which is an estimator of . The realized test statistic is thus


(10.10)
where and .
The two-sided test : versus : and the one-sided tests :
versus : or : versus : can be derived as in
Sect. 10.3.2; that is, the test decision is based on Table 10.2. In fact, the paired t-
test is a one-sample t-test on the differences of X and Y.

Example 10.3.4
In an experiment, students have to solve different tasks before and after
drinking a cup of coffee. Let Y and X denote the random variables “number of
points before/after drinking a cup of coffee”. Assume that a higher number of
points means that the student is performing better. Since the test is repeated on
the same students, we have a paired sample. The data is given in the following
table:

i (before) (after)

1 4 5 1 0
2 3 4 1 0
3 5 6 1 0
4 6 7 1 0
5 7 8 1 0
6 6 7 1 0
7 4 5 1 0
8 7 8 1 0
9 6 5 −1 4
10 2 5 3 4
Total 10 8

We calculate

respectively. For the realized test statistic t(d), using , we get

such that : is rejected and : is accepted. We can


conclude (for this example) that drinking coffee significantly increased the
problem-solving capacity of the students.
In R, we would have obtained the same results using the t.test function
and specifying the option paired=TRUE:

We can make the test decision using the R output in three different ways:

(i) We compare the test statistic ( ) with the critical value (1.83,
obtained via qt(0.95,9)).

(ii) We evaluate whether the p-value (0.008468) is smaller than the


significance level .

(iii) We evaluate whether the confidence interval for the mean difference
covers “0” or not.

10.4 Parametric Tests for Probabilities


10.4.1 One-Sample Binomial Test for the Probability
Test construction and hypotheses.
Let X be a Bernoulli B(1; p) random variable with the two possible outcomes
1 and 0, which indicate occurrence and non-occurrence of an event of interest A.
The probability for A in the population is p. From the sample
of independent B(1; p)-distributed random variables, we calculate the mean
(relative frequency) as which is an unbiased estimate of p. The
following hypotheses may thus be of interest:

Case Null hypothesis Alternative hypothesis


(a) Two-sided problem

(b) One-sided problem

(c) One-sided problem

In the following, we describe two possible solutions, one exact approach and
an approximate solution. The approximate solution is based on the
approximation of the binomial distribution by the normal distribution, which is
appropriate if n is sufficiently large and the condition holds (i.e. p is
neither too small nor too large). First, we present the approximate solution and
then the exact one.
Test statistic and test decisions.

(a) Approximate binomial test. We define the standardized test statistic as

(10.11)
It holds approximately that , given that the conditions that (i)
n is sufficiently large and (ii) are satisfied. The test can then
be conducted along the lines of the Gauss test in Sect. 10.3.1; that is, the
test decision is based on Table 10.1.

Example 10.4.1
We return to Example 10.1.1. Let us assume that a representative sample of size
has been drawn from the population of eligible voters, from which 700
(35 %) have voted for the party of interest P. The research hypothesis (which has
to be stated as ) is that more than 30 % (i.e. ) of the eligible voters cast
their votes for party P. The sample is in favour of because , but to
draw conclusions for the proportion of voters of party P in the population, we
have to conduct a binomial test. Since n is large and
, the assumptions for the use of the test
statistic (10.11) are satisfied. We can write down the realized test statistic as

Using , it follows that , and thus, the null


hypothesis can be rejected. Therefore, is statistically
significant; that is, the proportion of votes for party P is greater than 30 %.

(b) The exact binomial test can be constructed using the knowledge that under
, (i.e. the number of successes) follows a binomial
distribution. In fact, we can use Y directly as the test statistic:

The observed test statistic is . For the two-sided case (a), the two
critical numbers and ( ) which define the critical region, have to
be found such that

The null hypothesis is rejected if the test statistic, i.e. Y, is greater than or
equal to or less than or equal to . For the one-sided case, a critical
number c has to be found such that

for hypotheses of type (b) and

for hypotheses of type (c). If Y is less than the critical value c (for case (b))
or greater than the critical value (for case (c)), the null hypothesis is
rejected.

Example 10.4.2
We consider again Example 10.1.1 where we looked at the population of eligible
voters, from which 700 (35 %) have voted for the party of interest P. The
observed test statistic is and the alternative hypothesis is
, as in case (c). There are at least two ways in which we can obtain
the results:
(i) Long way: We can calculate the test statistic and compare it to the critical
region. To get the critical region, we search c such that

which equates to

and can be calculated in R as:

Since we reject the null hypothesis. As in Example


10.4.1, we conclude that there is enough evidence that the proportion of
votes for party P is greater than 30 %.

(ii) Short way: The above result can be easily obtained in R using the binom.
test() command. We need to specify the number of “successes” (here:
700), the number of “failures” , and the alternative
hypothesis:

Both the p-value (which is smaller than ) and the confidence


interval (for which we do not show the calculation) confirm the rejection of
the null hypothesis. Note that
returns the same result.

10.4.2 Two-Sample Binomial Test


Test construction and hypotheses.
We consider now the case of two independent i.i.d. samples from Bernoulli
distributions with parameters and .

The sums

follow binomial distributions. One of the following hypotheses may be of


interest:

Case Null hypothesis Alternative hypothesis


(a) Two-sided problem

(b) One-sided problem

(c) One-sided problem

Similar to the one-sample case, both exact and approximate tests exist. Here,
we only present the approximate test. The exact test of Fisher is presented in
Appendix C.5, p. 428. Let and denote the sample sizes. Then, and
are approximately normally distributed:

Their difference D

is normally distributed too under (given holds). Since the


probabilities and are identical under , we can pool the two samples and
estimate p by
(10.12)
Test statistic and test decision.
The test statistic

(10.13)
follows a N(0, 1)-distribution if and are sufficiently large and p is not
near the boundaries 0 and 1 (one could use, for example, again the condition
with ). The realized test statistic can be calculated using
the observed difference . The test can be conducted for the one-sided
and the two-sided case as the Gauss test introduced in Sect. 10.3.1; that is, the
decision rules from Table 10.1 can be applied.

Example 10.4.3
Two competing lotteries claim that every fourth lottery ticket wins. Suppose we
want to test whether the probabilities of winning are different for the two
lotteries, i.e. and . We have the following data

n Winning Not winning


Lottery A 63 14 49
Lottery B 45 13 32

We can estimate the probabilities of a winning ticket for each lottery, as well
as the respective difference, as

Under , an estimate for p following (10.12) is

The test statistic can be calculated as

is not rejected since . Thus, there is no


statistical evidence for different winning probabilities for the two lotteries. These
hypotheses can be tested in R using the Test of Fisher, see Appendix C.5, p. 428,
for more details.

10.5 Tests for Scale Parameters


There are various tests available to test hypotheses about scale parameters. Such
tests are useful when one is interested in the dispersion of a variable, for
example in quality control where the variability of a process may be of interest.
One-sample tests of hypotheses for the variance of a normal distribution, e.g.
hypotheses such as , can be tested by the -test for the variance ,
see Appendix C.5, p. 430. Two-sample problems can be addressed by the F-test
(which is explained in Appendix C.5, p. 431); or by other tests such as the
Levene test or Bartlett’s test, which are also available in R (leveneTest in the
package car, bartlett in the base distribution of R).

10.6 Wilcoxon–Mann–Whitney (WMW) U-Test


Test construction and hypotheses.
The WMW U-test is often proposed as an alternative to the t-test because it
also focuses on location but not on the expected value . It is a nonparametric
test and useful in situations where skewed distributions are compared with each
other. We consider two independent random samples and
from two populations with observed values
and , respectively. In this case, the null hypothesis considering
the location can be formulated as

The null hypothesis can be interpreted in the following way: the probability
that a randomly drawn observation from the first population has a value x that is
greater (or lower) than the value y of a randomly drawn subject from the second
population is . The alternative hypothesis is then

This means we are comparing the entire distribution of two variables. If there
is a location shift in the sense that one distribution is shifted left (or right)
compared with the other distribution, the null hypothesis will be rejected because
this shift can be seen as part of the alternative hypothesis . In
fact, under some assumptions, the hypothesis can even be interpreted as
comparing two medians, and this is what is often done in practice.
Observed test statistic.
To construct the test statistic, it is necessary to merge and
into one sorted sample, usually in ascending order, while keeping
the information which value belongs to which sample. For now, we assume that
all values of the two samples are distinct; that is, no ties are present. Then, each
observation has a rank between 1 and . Let be the sum of ranks of
the x-sample and let be the sum of ranks of the y-sample. The test statistic is
defined as U, where U is the minimum of the two values , ,
with

(10.14)

(10.15)
Test decision.
is rejected if . Here, is the critical value derived from the
distribution of U under the null hypothesis. The exact (complex) distribution
can, for example, be derived computationally (in R). We are presenting an
approximate solution together with its implementation in R.
Since , it is sufficient to compute only and
( or are chosen such that is calculated for the
sample with the lower sample size). For , one can use the approximation

(10.16)

as the test statistic. For two-sided hypotheses, is rejected if ;


for one-sided hypotheses is rejected if . In the case of ties, the
denominator of the test statistic in (10.16) can be modified as
where G is the number of different (groups of) ties and denotes the number
of tied ranks in tie group j.

Example 10.6.1
In a study, the reaction times (in seconds) to a stimulus were measured for two
groups. One group drank a strong coffee before the stimulus and the other group
drank only the same amount of water. There were 9 study participants in the
coffee group and 10 participants in the water group. The following reaction
times were recorded:

Reaction time 1 2 3 4 5 6 7 8 9 10
Coffee group (C) 3.7 4.9 5.2 6.3 7.4 4.4 5.3 1.7 2.9
Water group (W) 4.5 5.1 6.2 7.3 8.7 4.2 3.3 8.9 2.6 4.8

We test with the U-test whether there is a location difference between the
two groups. First, the ranks of the combined sample are calculated as:

1 2 3 4 5 6 7 8 9 10 Total
Value (C) 3.7 4.9 5.2 6.3 7.4 4.4 5.3 1.7 2.9
Rank (C) 5 10 12 15 17 7 13 1 3 83
Value (W) 4.5 5.1 6.2 7.3 8.7 4.2 3.3 8.9 2.6 4.8
Rank (W) 8 11 14 16 18 6 4 19 2 9 107

With = 83 and = 107, we get

With and ,

Since , the null hypothesis cannot be rejected;


that is, there is no statistical evidence that the two groups have different reaction
times.
In R, one can use the wilcox.test command to obtain the results:
The output is

We can see that the null hypothesis is not rejected because


. The displayed test statistic is W which equates to our
statistic . The alternative hypothesis in R is framed as location shift, an
interpretation which has already been given earlier in the chapter. Note that the
test also suggests that the medians of the two samples are not statistically
different.

10.7 -Goodness-of-Fit Test


Test construction.
The -goodness-of-fit test is one of the most popular tests for testing the
goodness of fit of the observed data to a distribution. The construction principle
is very general and can be used for variables of any scale. The test statistic is
derived such that the observed absolute frequencies are compared with the
expected absolute frequencies under the null hypothesis .

Example 10.7.1
Consider an experiment where a die is rolled times. Under the null
hypothesis , we assume that the die is fair, i.e. , where
. We could have also said that is the hypothesis that the rolls are
following a discrete uniform distribution. Thus, the expected absolute
frequencies under are , while the observed frequencies in the
sample are . The generally deviate from . The -statistic is
based on the squared differences, , and becomes large as the
differences between the observed and the expected frequencies become larger.
The -test statistic is a modification of this sum by scaling each squared
difference by the expected frequencies, , and is explained below.

With a nominal variable, we can proceed as in Example 10.7.1. If the scale of the
variable is ordinal or continuous, the number of different values can be large.
Note that in the most extreme case, we can have as many different values as
observations (n), leading to for all . Then, it is necessary to
group the data into k intervals before applying the -test. The reason is that the
general theory of the -test assumes that the number k (which was 6 in Example
10.7.1 above) is fixed and does not grow with the number of observations n; that
is, the theory says that the -test only works properly if k is fixed and n is large.
For this reason, we group the sample into k classes as shown
in Sect. 2.1.

Class 1 2 k Total
Number of observations n

The choice of the class intervals is somewhat arbitrary. As a rule of thumb


should hold for most class intervals. The general hypotheses can be
formulated in the form of distribution functions:

Test statistic.
The test statistic is defined as

(10.17)
Here,
( ) are the absolute frequencies of observations of the sample
in class i, is a random variable with realization in the observed
sample;
( ) are calculated from the distribution under , , and
are the (hypothetical) probabilities that an observation of X falls in class i;
are the expected absolute frequencies in class i under .
Test decision.
For a significance level , is rejected if t(x) is greater than the -
quantile of the -distribution with degrees of freedom, i.e. if

Note that r is the number of parameters of , if these parameters are


estimated from the sample. The -test statistic is only asymptotically -
distributed under .

Example 10.7.2
Let be the distribution function of the test distribution. If one specifies a
normal distribution such as , or a discrete uniform distribution
with ( ), then , since no parameters have to be estimated
from the data. Otherwise, if we simply want to test whether the data is generated
from a normal distribution or the data follows a normal distribution
, then and may be estimated from the sample by and . Then,
and the number of degrees of freedom is reduced.

Example 10.7.3
Gregor Mendel (1822–1884) conducted crossing experiments with pea plants of
different shape and colour. Let us look at the outcome of a pea crossing
experiment with the following results:

Crossing result Round Round Edged Edged


Yellow Green Yellow Green
Observations 315 108 101 32

Mendel had the hypothesis that the four different types occur in proportions
of 9:3:3:1, that is
The hypotheses are

With observations, the test statistic can be calculated from the


following observed and expected frequencies:

1 315 312.75

2 108 104.25

3 101 104.25

4 32 34.75

The -test statistic is calculated as

Since = 0.47 < 7.815 = = , the null hypothesis is not rejected.


Statistically, there is no evidence that Mendel was wrong with his 9:3:3:1
assumption. In R, the test can be conducted by applying the chisq.test
command:

which leads to the following output

and the critical value is


Remark 10.7.1
In this example, the data was already summarized in a frequency table. For raw
data, the table command can be used to preprocess the data, i.e. we can use
chisq.test(table(var1,var2)).

Another popular goodness-of-fit test is the test of Kolmogorov–Smirnov . There


are two different versions of this test, one for the one-sample scenario and one
for the two-sample scenario. The null hypothesis for the latter is that the two
independent samples come from the same distribution. In R, the command
ks.test() can be used to perform Kolmogorov–Smirnov tests.

10.8 -Independence Test and Other -Tests


In Chap. 4, we introduced different methods to describe the association between
two variables. Several association measures are possibly suitable if the variables
are categorical, for example Cramer’s V, Goodman’s and Kruskal’s ,
Spearman’s rank correlation coefficient, and the odds ratio. If we are not
interested in the strength of association but rather in finding out whether there is
an association at all, one can use the -independence test.
Test construction.
In the following we assume that we observe a sample from a bivariate
discrete distribution of two variables X and Y which can be summarized in a
contingency table with absolute frequencies , ( ; ):

Y
1 2 J
X 1

Remember that
is the row sum,
is the column sum, and
n is the total number of observations.
The hypotheses are : X and Y are independent versus and Y are not
independent. If X and Y are independent, then the expected frequencies are

(10.18)
Test statistic.
Pearson’s -test statistic was introduced in Chap. 4, Eq. (4.6). It is

where (expected absolute cell frequencies under ).


Strictly speaking, are the true, unknown expected frequencies under and
are estimated by , such that the realized test statistic equates to

(10.19)
Test decision.
The number of degrees of freedom under is , where are
the parameters which have to be estimated for the marginal distribution of X, and
are the number of parameters for the marginal distribution of Y. The test
decision is:

Note that the alternative hypothesis is very general. If is rejected,


nothing can be said about the structure of the dependence of X and Y from the
-value itself.

Example 10.8.1
Consider the following contingency table. Here, X describes the educational
level (1: primary, 2: secondary, 3: tertiary) and Y the preference for a specific
political party (1: Party A, 2: Party B, 3: Party C). Our null hypothesis is that the
two variables are independent, and we want to show the alternative hypothesis
which says that there is a relationship between them.

Y Total
1 2 3
X 1 100 200 300 600
2 100 100 100 300
3 80 10 10 100
Total 280 310 410 1000

For the (estimated) expected frequencies , we get

Y
1 2 3
X 1 168 186 246
2 84 93 123
3 28 31 41

For example: . The test statistic is

Since , is rejected.
In R, either the summarized data (as shown below) can be used to calculate
the test statistic or the raw data (summarized in a contingency table via
table(var1,var2)):

The output is

with the critical value


which confirms our earlier manual calculations. The p-value is smaller than
which further confirms that the null hypothesis has to be rejected.

For a binary outcome, the -test of independence can be formulated as a test for
the null hypothesis that the proportions of the binary variable are equal in several
( ) groups, i.e. for a (or ) table. This test is called the -test of
homogeneity.

Example 10.8.2
Consider two variables X and Y, where X is describing the rating of a coffee
brand with the categories “bad taste” and “good taste” and Y denotes three age
subgroups, e.g. “18–25”, “25–35”, and “35–45”. The observed data is

Y
18–25 25–35 35–45 Total
X Bad 10 30 65 105
Good 90 70 35 195
Total 100 100 100 300

Assume is the hypothesis that the probabilities ,


, and are all equal. Then, we
can use the function either prop.test or chisq.test in R to test this
hypothesis:

This produces the following outputs:


and

The results (test statistic, p-value) are identical and is rejected. Note that
prop.test strictly expects a table (i.e. exactly 2 columns).

Remark 10.8.1
For -tables with small sample sizes and therefore small cell frequencies, it
is recommended to use the exact test of Fisher as described in Appendix C.5.

Remark 10.8.2
The test described in Example 10.8.2 is a special case (since one variable is
binary) of the general -test of homogeneity . The -test of homogeneity is
valid for any table, where K is the number of subgroups of a variable Y
and C is the number of values of the outcome X of interest. The null hypothesis
assumes that the conditional distributions of X given Y are identical in all
subgroups, i.e.

forall . Again, the usual -test statistic can


be used.

10.9 Key Points and Further Issues


10.10 Exercises
Exercise 10.1
Two people, A and B, are suspects for having committed a crime together. Both
of them are interrogated in separate rooms. The jail sentence depends on who
confesses to have committed the crime, and who does not:

B does not confess B does confess


A does not confess Each serves 1 year A: 3 years; B: goes free
A does confess A: goes free; B: 3 years Each serves 2 years

A has two hypotheses:

Given the possible sentences he decides to not confess if is true and to


confess otherwise. Explain the concepts of type I error and type II error for this
situation. Comment on the consequences if these errors are made.

Exercise 10.2
A producer of chocolate bars hypothesizes that his production does not adhere to
the weight standard of 100 g. As a measure of quality control, he weighs 15 bars
and obtains the following results in grams:

96.40, 97.64, 98.48, 97.67, 100.11, 95.29, 99.80, 98.80, 100.53, 99.41, 97.64,
101.11, 93.43, 96.99, 97.92
It is assumed that the production process is standardized in the sense that the
variation is controlled to be .

(a) What are the hypotheses regarding the expected weight for a two-sided
test?

(b) Which test should be used to test these hypotheses?

(c) Conduct the test that was suggested to be used in (b). Use .

(d) The producer wants to show that the expected weight is smaller than 100 g.
What are the appropriate hypotheses to use?

(e) Conduct the test for the hypothesis in (d). Again use .

Exercise 10.3
Christian decides to purchase the new CD by Bruce Springsteen. His first
thought is to buy it online, via an online auction. He discovers that he can also
buy the CD immediately, without bidding at an auction, from the same online
store. He also looks at the price at an internet book store which was
recommended to him by a friend. He notes down the following prices (in €):
Internet book store 16.95
Online store, no auction 18.19, 16.98, 19.97, 16.98, 18.19, 15.99, 13.79,
15.90, 15.90, 15.90, 15.90, 15.90, 19.97, 17.72
Online store, auction 10.50, 12.00, 9.54, 10.55, 11.99, 9.30, 10.59, 10.50,
10.01, 11.89, 11.03, 9.52, 15.49, 11.02

(a) Calculate and interpret the arithmetic mean, variance, standard deviation,
and coefficient of variation for the online store, both for the auction and
non-auction offers.
(b) Test the hypothesis that the mean price at the online store (no auction) is
unequal to €16.95 ( ).

(c) Calculate a confidence interval for the mean price at the online store (no
auction) and interpret your findings in the light of the hypothesis in (b).

(d) Test the hypothesis that the mean price at the online store (auction) is less
than €16.95 ( ).

(e) Test the hypothesis that the mean non-auction price is higher than the mean
auction price. Assume that (i) the variances are equal in both samples and
(ii) the variances are unequal ( ).

(f) Test the hypothesis that the variance of the non-auction price is unequal to
the variance of the auction price ( ).

(g) Use the U-test to compare the location of the auction and non-auction
prices. Compare the results with those of (e).

(h) Calculate the results of (a)–(g) with R.

Exercise 10.4
Ten of Leonard’s best friends try a new diet: the “Banting” diet. Each of them
weighs him/herself before and after the diet. The data is as follows:

Person (i) 1 2 3 4 5 6 7 8 9 10
Before diet ( ) 80 95 70 82 71 70 120 105 111 90

After diet ( ) 78 94 69 83 65 69 118 103 112 88

Choose a test and a confidence interval to test whether there is a difference


between the mean weight before and after the diet ( ).

Exercise 10.5
A company producing clothing often finds deficient T-shirts among its
production.

(a) The company’s controlling department decides that the production is no


longer profitable when there are more than 10 % deficient shirts. A sample
of 230 shirts yields 30 shirts which contain deficiencies. Use the
approximate binomial test to decide whether the T-shirt production is
profitable or not ( ).

(b) Test the same hypothesis as in (a) using the exact binomial test. You can
use R to determine the quantiles needed for the calculation.

(c) The company is offered a new cutting machine. To test whether the change
of machine helps to improve the production quality, 115 sample shirts are
evaluated, 7 of which have deficiencies. Use the two-sample binomial test
to decide whether the new machine yields improvement or not ( ).

(d) Test the same hypothesis as in (c) using the test of Fisher in R.

Exercise 10.6
Two friends play a computer game and each of them repeats the same level 10
times. The scores obtained are:

1 2 3 4 5 6 7 8 9 10
Player 1 91 101 112 99 108 88 99 105 111 104
Player 2 261 47 40 29 64 6 87 47 98 351

(a) Player 2 insists that he is the better player and suggests to compare their
mean performance. Use an appropriate test ( ) to test this
hypothesis.

(b) Player 1 insists that he is the better player. He proposes to not focus on the
mean and to use the U-test for comparison. What are the advantages and
disadvantages of using this test compared with (a)? What are the results (
)?
Exercise 10.7
Otto loves gummy bears and buys 10 packets at a factory store. He opens all
packets and sorts them by their colour. He counts 222 white gummy bears, 279
red gummy bears, 251 orange gummy bears, 232 yellow gummy bears, and 266
green ones. He is disappointed since white (pineapple flavour) is his favourite
flavour. He hypothesizes that the producer of the bears does not uniformly
distribute the bears into the packets. Choose an appropriate test to find out
whether Otto’s speculation could be true.

Exercise 10.8
We consider Exercise 4.4 where we evaluated which of the passengers from the
Titanic were rescued. The data was summarized as follows:

1. Class 2. Class 3. Class Staff Total


Rescued 202 125 180 211 718
Not rescued 135 160 541 674 1510

(a) The hypothesis derived from the descriptive analysis was that travel class
and rescue status are not independent. Test this hypothesis.

(b) Interpret the following R output:

(c) Summarize the data in a 2 table: passengers from the first and second
class should be grouped together, and third class passengers and staff
should be grouped together as well. Is the probability of being rescued
higher in the first and second class? Provide an answer using the following
three tests: exact test of Fisher, -independence test, and -homogeneity
test. You can use R to conduct the test of Fisher.
Exercise 10.9
We are interested in understanding how well the t-test can detect differences with
respect to the mean. We use R to draw 3 samples each of 20 observations from
three different normal distributions: , , and .
The summary statistics of this experiment are as follows:
, ,

, ,

, .

(a) Use the t-test to compare the means of X and .

(b) Use the t-test to compare the means of X and .

(c) Interpret the results from (a) and (b).

Exercise 10.10
Access the theatre data described in Appendix A.4. The data summarizes a
survey conducted on visitors of a local Swiss theatre in terms of age, sex, annual
income, general expenditure on cultural activities, expenditure on theatre visits,
and the estimated expenditure on theatre visits in the year before the survey was
done.

(a) Compare the mean expenditure on cultural activities for men and women
using the Welch test ( ).

(b) Would the conclusions change if the two-sample t-test or the U-test were
used for comparison?

(c) Test the hypothesis that women spend on average more money on theatre
visits than men ( ).

(d) Compare the mean expenditure on theatre visits in the year of the survey
and the preceding year ( ).

Exercise 10.11
Use R to read in and analyse the pizza data described in Appendix A.4 (assume
).

(a) The manager’s aim is to deliver pizzas in less than 30 min and with a
temperature of greater than 65 C. Use an appropriate test to evaluate
whether these aims have been reached on average.

(b) If it takes longer than 40 min to deliver the pizza, then the customers are
promised a free bottle of wine. This offer is only profitable if less than
15 % of deliveries are too late. Test the hypothesis .

(c) The manager wonders whether there is any relationship between the
operator taking the phone call and the pizza temperature. Assume that a hot
pizza is defined to be one with a temperature greater 65 C. Use the test of
Fisher, the -independence test, and the -test of homogeneity to test his
hypothesis.

(d) Each branch employs the same number of staff. It would thus be desirable
if each branch receives the same number of orders. Use an appropriate test
to investigate this hypothesis.

(e) Is the proportion of calls taken by each operator the same in each branch?

(f) Test whether there is a relationship between drivers and branches.

Exercise 10.12
The authors of this book went to visit historical sites in India. None of them has
a particularly strong interest in photography, and they speculated that each of
them would take about the same number of pictures on their trip. After returning
home, they counted 110, 118, and 105 pictures, respectively. Use an appropriate
test to find out whether their speculation was correct ( ).
Source Toutenburg, H., Heumann, C., Induktive Statistik, 4th edition, 2007,
Springer, Heidelberg
© Springer International Publishing Switzerland 2016
Christian Heumann, Michael Schomaker and Shalabh, Introduction to Statistics and Data Analysis ,
DOI 10.1007/978-3-319-46162-5_11

11. Linear Regression


Christian Heumann1 , Michael Schomaker2 and Shalabh3
(1) Department of Statistics, Ludwig-Maximilians-Universität München,
München, Germany
(2) Centre for Infectious Disease Epidemiology and Research, University of
Cape Town, Cape Town, South Africa
(3) Department of Mathematics and Statistics, Indian Institute of Technology
Kanpur, Kanpur, India

Christian Heumann
Email: Christian.heumann@stat.uni-muenchen.de

We learnt about various measures of association in Chap. 4. Such measures are


used to understand the degree of relationship or association between two
variables. The correlation coefficient of Bravais–Pearson, for example, can help
us to quantify the degree of a linear relationship, but it does not tell us about the
functional form of a relationship. Moreover, any association of interest may
depend on more than one variable, and we would therefore like to explore
multivariate relationships. For example, consider a situation where we measured
the body weights and the long jump results (distances) of school children taking
part in a competition. The correlation coefficient of Bravais–Pearson between
the body weight and distance jumped can tell us how strong or weak the
association between the two variables is. It can also tell us whether the distance
jumped increases or decreases with the body weight in the sense that a negative
correlation coefficient indicates shorter jumps for higher body weights, whereas
a positive coefficient implies longer jumps for higher body weights. Suppose we
want to know how far a child with known body weight, say 50 kg, will jump.
Such questions cannot be answered from the value and sign of the correlation
coefficient. Another question of interest may be to explore whether the
relationship between the body weight and distance jumped is linear or not. One
may also be interested in the joint effect of age and body weight on the distance
jumped. Could it be that older children, who have on average a higher weight
than younger children, perform better? What would be the association of weight
and the long jump results of children of the same age? Such questions are
addressed by linear regression analysis which we introduce in this chapter.
In many situations, the outcome does not depend on one variable but on
several variables. For example, the recovery time of a patient after an operation
depends on several factors such as weight, haemoglobin level, blood pressure,
body temperature, diet control, rehabilitation, and others. Similarly, the weather
depends on many factors such as temperature, pressure, humidity, speed of
winds, and others. Multiple linear regression, introduced in Sect. 11.6, addresses
the issue where the outcome depends on more than one variable.
To begin with, we consider only two quantitative variables X and Y in which
the outcome Y depends on X and we explore the quantification of their
relationship.

Examples
Examples of associations in which we might be interested in are:
body height (X) and body weight (Y) of persons,
speed (X) and braking distance (Y) measured on cars,
money invested (in €) in the marketing of a product (X) and sales figures for
this product (in €) (Y) measured in various branches,
amount of fertilizer used (X) and yield of rice (Y) measured on different
acres, and
temperature (X) and hotel occupation (Y) measured in cities.

11.1 The Linear Model


Consider the scatter plots from Fig. 4.2 on p. 80. Plotting X-values against Y-
values enables us to visualize the relationship between two variables.
Figure 11.1a reviews what a positive (linear) association between X and Y looks
like: the higher the X values, the higher the Y-values (and vice versa). The plot
indicates that there may be a linear relationship between X and Y. On the other
hand, Fig. 11.1b displays a scatter plot which shows no clear relationship
between X and Y. The measure, shown in the two figures and explained in
more detail later, equates to the squared correlation coefficient of Bravais–
Pearson.

Fig. 11.1 Scatter plots

To summarize the observed association between X and Y, we can postulate


the following linear relationship between them:
(11.1)
This Eq. (11.1) represents a straight line where is the intercept and
represents the slope of the line. The slope indicates the change in the Y-value
when the X-value changes by one unit. If the sign of is positive, it indicates
that the value of Y increases as the value of X increases. If the sign of is
negative, it indicates that the value of Y decreases as the value of X increases.
When , then . If , then represents a line passing through
the origin. Suppose the height and body weights in the example of school
children are represented in Fig. 11.1a. This has the following interpretation:
when the height of a child increases by 1 cm, then the body weight increases by
kilograms. The slope in Fig. 11.1a would certainly be positive because we
have a positive linear relationship between X and Y. Note that in this example,
the intercept term has no particular meaning because when the height , the
body weight . The scatter diagram in Fig. 11.1b does not exhibit any
clear relationship between X and Y. Still, the slope of a possible line would likely
be somewhere around 0.
It is obvious from Fig. 11.1a that by assuming a linear relationship between
X and Y, any straight line will not exactly match the data points in the sense that
it cannot pass through all the observations: the observations will lie above and
below the line. The line represents a model to describe the process generating the
data.
In our context, a model is a mathematical representation of the relationship
between two or more variables. A model has two components—variables (e.g.
X, Y) and parameters (e.g. ). A model is said to be linear if it is linear in its
parameters. A model is said to be nonlinear if it is nonlinear in its parameters
(e.g. instead of ). Now assume that each observation potentially deviates by
from the line in Fig. 11.1a. The linear model in Eq. (11.1) can then be written
as follows to take this into account:
(11.2)
Suppose we have n observations , then each
observation satisfies
(11.3)
Each deviation is called an error. It represents the deviation of the data
points from the regression line . The line in Fig. 11.1a is the fitted
(regression) line which we will discuss in detail later. We assume that the errors
are identically and independently distributed with mean 0 and constant
variance , i.e. for all . We will discuss these
assumptions in more detail in Sect. 11.7.
In the model (11.2), Y is called the response , response variable ,
dependent variable or outcome ; X is called the covariate , regressor or
independent variable . The scalars and are the parameters of the model and
are described as regression coefficients or parameters of the linear model. In
particular, is called the intercept term and is called the slope parameter.
It may be noted that if the regression parameters and are known, then the
linear model is completely known. An important objective in identifying the
model is to determine the values of the regression parameters using the available
observations for X and Y. It can be understood from Fig. 11.1a that an ideal
situation would be when all the data points lie exactly on the line or, in other
words, the error is zero for each observation. It seems meaningful to
determine the values of the parameters in such a way that the errors are
minimized.
There are several methods to estimate the values of the regression
parameters. In the remainder of this chapter, we describe the methods of least
squares and maximum likelihood estimation.

11.2 Method of Least Squares


Suppose n sets of observations , are obtained on two
variables and are plotted in a scatter plot. An example of four
observations, , and , is given in Fig. 11.2.

Fig. 11.2 Regression line, observations, and errors

The method of least squares says that a line can be fitted to the given data
set such that the errors are minimized. This implies that one can determine and
such that the sum of the squared distances between the data points and the line
is minimized. For example, in Fig. 11.2, the first data point
does not lie on the plotted line and the deviation is . Similarly,
we obtain the difference of the other three data points from the line: . The
error is zero if the point lies exactly on the line. The problem we would like to
solve in this example is to minimize the sum of squares of , and , i.e.

(11.4)
We want the line to fit the data well. This can generally be achieved by
choosing and such that the squared errors of all the n observations are
minimized:

(11.5)
If we solve this optimization problem by the principle of maxima and
minima, we obtain estimates of and as

(11.6)

see Appendix C.6 for a detailed derivation. Here, and represent the
estimates of the parameters and , respectively, and are called the least
squares estimator of and , respectively. This gives us the model
which is called the fitted model or the fitted regression line . The literal meaning
of “regression” is to move back. Since we are acquiring the data and then
moving back to find the parameters of the model using the data, it is called a
regression model. The fitted regression line describes the postulated
relationship between Y and X. The sign of determines whether the relationship
between X and Y is positive or negative. If the sign of is positive, it indicates
that if X increases, then Y increases too. On the other hand, if the sign of is
negative, it indicates that if X increases, then Y decreases. For any given value of
X, say , the predicted value is calculated by

and is called the ith fitted value .


If we compare the observed data point with the point suggested
(predicted, fitted) by the regression line, , the difference between and
is called the residual and is given as
(11.7)
This can be viewed as an estimator of the error . Note that it is not really an
estimator in the true statistical sense because is random and so it cannot be
estimated. However, since the are unknown and measures the same
difference between the estimated and true values of the y’s, see for example
Fig. 11.2, it can be treated as estimating the error .

Example 11.2.1
A physiotherapist advises 12 of his patients, all of whom had the same knee
surgery done, to regularly perform a set of exercises. He asks them to record
how long they practise. He then summarizes the average time they practised (X,
time in minutes) and how long it takes them to regain their full range of motion
again (Y, time in days). The results are as follows:

i 1 2 3 4 5 6 7 8 9 10 11 12
24 35 64 20 33 27 42 41 22 50 36 31

90 65 30 60 60 80 45 45 80 35 50 45

To estimate the linear regression line , we first calculate


and . To obtain and we need the following table:

1 24 90 11.41 32.92 375.61 130.19


2 35 65 0.41 7.92 3.25 0.17
3 64 30 28.59 27.08 774.22 817.39
4 20 60 15.41 2.92 45.00 237.47
5 33 60 2.41 2.92 7.27 5.81
6 27 80 8.41 22.92 192.75 70.73
7 42 45 6.59 12.08 79.61 43.43
8 41 45 5.59 12.08 67.53 31.25
9 22 80 13.41 22.92 307.36 179.83
10 50 35 14.59 22.08 322.14 212.87
11 36 50 0.59 7.08 4.18 0.35
12 31 45 4.41 12.08 53.27 19.45
Total 2125.65 1748.94

Using (11.6), we can now easily find the least squares estimates and as
The fitted regression line is therefore

We can interpret the results as follows:


For an increase of 1 min in exercising, the recovery time decreases by
1.22 days because . The negative sign of indicates that the
relationship between exercising time and recovery time is negative; i.e. as
exercise time increases, the recovery time decreases.
When comparing two patients with a difference in exercising time of
10 min, the linear model estimates a difference in recovery time of
12.2 days because .

The model predicts an average recovery time of


days for a patient who practises the set of exercises for 38 min.
If a patient did not exercise at all, the recovery time would be predicted as
days by the model. However, see item (i) in Sect. 11.2.1 below
about interpreting a regression line outside of the observed data range.
We can also obtain these results by using R. The command fits a
linear model and provides the estimates of and .

We can draw the regression line onto a scatter plot using the command
abline, see also Fig. 11.3.
Fig. 11.3 Scatter plot and regression line for Example 11.2.1

11.2.1 Properties of the Linear Regression Line


There are a couple of interesting results related to the regression line and the
least square estimates.

(i) As a rule of thumb, one should interpret the regression line


only in the interval . For example, if X denotes “Year”, with
observed values from 1999 to 2015, and Y denotes the “annual volume of
sales of a particular company”, then a prediction for the year 2030 may not
be meaningful or valid because a linear relationship discovered in the past
may not continue to hold in the future.

(ii) For the points , forming the regression line, we can write

(11.8)

(iii) It follows for that , i.e. the point always lies on the
regression line. The linear regression line therefore always passes through
.
(iv) The sum of the residuals is zero. The ith residual is defined as

The sum is therefore

(11.9)

(v) The arithmetic mean of is the same as the arithmetic mean of y:

(11.10)

(vi) The least squares estimate has a direct relationship with the correlation
coefficient of Bravais–Pearson:

(11.11)
The slope of the regression line is therefore proportional to the correlation
coefficient r: a positive correlation coefficient implies a positive estimate
of and vice versa. However, a stronger correlation does not necessarily
imply a steeper slope in the regression analysis because the slope depends
upon as well.

11.3 Goodness of Fit


While one can easily fit a linear regression model, this does not mean that the
model is necessarily good. Consider again Fig. 11.1: In Fig. 11.1a, the regression
line is clearly capturing the linear trend seen in the scatter plot. The fit of the
model to the data seems good. Figure 11.1b shows however that the data points
vary considerably around the line. The quality of the model does not seem to be
very good. If we would use the regression line to predict the data, we would
likely obtain poor results. It is obvious from Fig. 11.1 that the model provides a
good fit to the data when the observations are close to the line and capture a
linear trend.
A look at the scatter plot provides a visual and qualitative approach to
judging the quality of the fitted model. Consider Fig. 11.4a, b which both show a
linear trend in the data, but the observations in Fig. 11.4a are closer to the
regression line than those in Fig. 11.4b. Therefore, the goodness of fit is worse in
the latter figure and any quantitative measure should capture this.

Fig. 11.4 Different goodness of fit for different data

A quantitative measure for the goodness of fit can be derived by means of


variance decomposition of the data. The total variation of y is partitioned into
two components—sum of squares due to the fitted model and sum of squares
due to random errors in the data:

(11.12)

The proof of the above equation is given in Appendix C.6.


The left-hand side of (11.12) represents the total variability in y with respect
to . It is proportional to the sample variance (3.21) and is called (total
sum of squares). The first term on the right-hand side of the equation measures
the variability which is explained by the fitted linear regression model (
, sum of squares due to regression); the second term relates to the error
sum of squares ( ) and reflects the variation due to random error involved
in the fitted model. Larger values of indicate that the deviations of the
observations from the regression line are large. This implies that the model
provides a bad fit to the data and that the goodness of fit is low.
Obviously, one would like to have a fit in which the error sum of squares is
as small as possible. To judge the goodness of fit, one can therefore look at the
error sum of squares in relation to the total sum of squares: in an ideal situation,
if the error sum of squares equals zero, the total sum of squares is equal to the
sum of squares due to regression and the goodness of fit is optimal. On the other
hand, if the sum of squares due to error is large, it will make the sum of squares
due to regression smaller and the goodness of fit should be bad. If the sum of
squares due to regression is zero, it is evident that the model fit is the worst
possible. These thoughts are reflected in the criterion for the goodness of fit, also
known as :

(11.13)
It follows from the above definition that . The closer is to 1, the
better the fit because will then be small. The closer is to 0, the worse
the fit, because will then be large. If takes any other value, say
, it means that only 70 % of the variation in data is explained by the
fitted model, and hence, in simple language, the model is 70 % good. An
important point to remember is that is defined only when there is an intercept
term in the model (an assumption we make throughout this chapter). So it is not
used to measure the goodness of fit in models without an intercept term.

Example 11.3.1
Consider again Fig. 11.1: In Fig. 11.1a, the line and data points fit well together.
As a consequence is high, . Figure 11.1b shows data points with
large deviations from the regression line; therefore, is small, here .
Similarly, in Fig. 11.4a, an of 0.97 relates to an almost perfect model fit,
whereas in Fig. 11.4b, the model describes the data only moderately well (
).

Example 11.3.2
Consider again Example 11.2.1 where we analysed the relationship between
exercise intensity and recovery time for patients convalescing from knee surgery.
To calculate , we need the following table:

1 90 70.84 32.92 1083.73 13.76 189.34


2 65 57.42 7.92 62.73 0.34 0.12
3 30 22.04 27.08 733.33 35.04 1227.80
4 60 75.72 2.92 8.53 18.64 347.45
5 60 59.86 2.92 8.53 2.78 7.73
6 80 67.18 22.92 525.33 10.10 102.01
7 45 48.88 12.08 145.93 8.2 67.24
8 45 50.10 12.08 145.93 6.83 48.72
9 80 73.28 22.92 525.33 16.20 262.44
10 35 39.12 22.08 487.53 17.96 322.56
11 50 56.20 7.08 50.13 0.88 0.72
12 45 62.30 12.08 145.93 5.22 27.25
Total 3922.96 2603.43

We calculate with these results as

We conclude that the regression model provides a reasonable but not perfect
fit to the data because 0.66 is not close to 0, but also not very close to 1. About
66 % of the variability in the data can be explained by the fitted linear regression
model. The rest is random error: for example, individual variation in the
recovery time of patients due to genetic and environmental factors, surgeon
performance, and others.
We can also obtain this result in R by looking at the summary of the linear
model:

Please note that we give a detailed explanation of the model summary in


Sect. 11.7.

There is a direct relationship between and the correlation coefficient of


Bravais–Pearson r:

(11.14)
The proof is given in Appendix C.6.

Example 11.3.3
Consider again Examples 11.3 and 11.5 where we analysed the association of
exercising time and time to regain full range of motion after knee surgery. We
calculated . We therefore know that the correlation coefficient is
.

11.4 Linear Regression with a Binary Covariate


Until now, we have assumed that the covariate X is continuous. It is however
also straightforward to fit a linear regression model when X is binary, i.e. if X has
two categories. In this case, the values of X in the first category are usually
coded as 0 and the values of X in the second category are coded as 1. For
example, if the binary variable is “gender”, we replace the word “male” with the
number 0 and the word “female” with 1. We can then fit a linear regression
model using these numbers, but the interpretation differs from the interpretations
in case of a continuous variable. Recall the definition of the linear model,
with ; if (male) then and if
(female), then . Thus, is the average value of Y for
males, i.e. , and . It follows that those
subjects with (e.g. females) have on average Y-values which are units
higher than subjects with (e.g. males).

Example 11.4.1
Recall Examples 11.2.1, 11.3.2, and 11.3.3 where we analysed the association of
exercising time and recovery time after knee surgery. We keep the values of Y
(recovery time, in days) and replace values of X (exercising time, in minutes)
with 0 for patients exercising for less than 40 min and with 1 for patients
exercising for 40 min or more. We have therefore a new variable X indicating
whether a patient is exercising a lot ( ) or not ( ). To estimate the linear
regression line , we first calculate and . To obtain
and , we need the following table:

1 0 90 32.92 10.97 0.11

2 0 65 7.92 2.64 0.11


3 1 30 27.08 18.05 0.44

4 0 60 2.92 0.97 0.11

5 0 60 2.92 0.97 0.11

6 0 80 22.92 7.64 0.11

7 1 45 12.08 8.05 0.44

8 1 45 12.08 8.05 0.44

9 0 80 22.92 7.64 0.11

10 1 35 22.08 14.72 0.44

11 0 50 7.08 2.36 0.11

12 0 45 12.08 4.03 0.11

Total Total 72.34 2.64

We can now calculate the least squares estimates of and using (11.6) as

The fitted regression line is:

The interpretation is:


The average recovery time for patients doing little exercise is
days.
Patients exercising heavily ( ) have, on average, a 27.4 days shorter
recovery period ( ) than patients with a short exercise time ( ).
The average recovery time for patients exercising heavily is 38.8 days (
for ).
These interpretations are visualized in Fig. 11.5. Because “exercise time”
(X) is considered to be a nominal variable (two categories representing the
intervals [0; 40) and )), we can plot the regression line on a scatter
plot as two parallel lines.

Fig. 11.5 Scatter plot and regression lines for Examples 11.4.1 and 11.5.1

11.5 Linear Regression with a Transformed Covariate


Recall that a model is said to be linear when it is linear in its parameters. The
definition of a linear model is not at all related to the linearity in the covariate.
For example,

is not a linear model because the right-hand side of the equation is not a linear
function in . However,

is a linear model. This model can be fitted as for any other linear model: we
obtain and as usual and simply use the squared values of X instead of X.
This can be justified by considering , and then, the model is written as
which is again a linear model. Only the interpretation changes:
for each unit increase in the squared value of X, i.e. increases by units.
Such an interpretation is often not even needed. One could simply plot the
regression line of Y on to visualize the functional form of the effect. This is
also highlighted in the following example.

Example 11.5.1
Recall Examples 11.2.1, 11.3.2, 11.3.3, and 11.4.1 where we analysed the
association of exercising time and recovery time after knee surgery. We
estimated as by using X as it is, as a continuous variable, see also
Fig. 11.3. When using a binary X, based on a cut-off of 40 min, we obtained
, see also Fig. 11.5. If we now use rather than X, we obtain
. This means for an increase of 1 unit of the square root of exercising
time, the recovery time decreases by 15.1 days. Such an interpretation will be
difficult to understand for many people. It is better to plot the linear regression
line , see Fig. 11.5. We can see that the new nonlinear line
(obtained from a linear model) fits the data nicely and it may even be preferable
to a straight regression line. Moreover, the value of substantially increased
with this modelling approach, highlighting that no exercising at all may severely
delay recovery from the surgery (which is biologically more meaningful). In R,
we obtain these results by either creating a new variable or by using the I()
command which allows specifying transformations in regression models.

Generally the covariate X can be replaced by any transformation of it, such as


using , or . The details are discussed in Sect. 11.6.3. It is also
possible to compare the quality and goodness of fit of models with different
transformations (see Sect. 11.8).
11.6 Linear Regression with Multiple Covariates
Up to now, we have only considered two variables—Y and X. We therefore
assume that Y is affected by only one covariate X. However, in many situations,
there might be more than one covariate which affects Y and consequently all of
them are relevant to the analysis (see also Sect. 11.10 on the difference between
association and causation). For example, the yield of a crop depends on several
factors such as quantity of fertilizer, irrigation, and temperature, among others.
In another example, in the pizza data set (described in Appendix A.4), many
variables could potentially be associated with delivery time—driver, amount of
food ordered, operator handling the order, whether it is a weekend, and so on. A
reasonable question to ask is: Do different drivers have (on average) a different
delivery time—given they deal with the same operator, the amount of food
ordered is the same, the day is the same, etc. These kind of questions can be
answered with multiple linear regression . The model contains more than one,
say p, covariates . The linear model (11.2) can be extended as

(11.15)
Note that the intercept term is denoted here by . In comparison with (11.2),
and .

Example 11.6.1
For the pizza delivery data, a particular linear model with multiple covariates
could be specified as follows:

11.6.1 Matrix Notation


If we have a data set of n observations, then every set of observations satisfies
model (11.15) and we can write

(11.16)

It is possible to write the n equations in (11.16) in matrix notation as


(11.17)
where

The letters are written in bold because they refer to vectors and matrices
rather than scalars. The capital letter makes it clear that is a matrix of order
representing the n observations on each of the covariates .
Similarly, is a vector of n observations on is a vector of
regression coefficients associated with , and is a vector of n
errors. The lower case letter relates to a vector representing a variable which
means we can denote the multiple linear model from now on also as
(11.18)
where is the vector of 1’s. We assume that and (see
Sect. 11.9 for more details).
We would like to highlight that is not the data matrix. The matrix is
called the design matrix and contains both a column of 1’s denoting the
presence of the intercept term and all explanatory variables which are relevant to
the multiple linear model (including possible transformations and interactions,
see Sects. 11.6.3 and 11.7.3). The errors reflect the deviations of the
observations from the regression line and therefore the difference between the
observed and fitted relationships. Such deviations may occur for various reasons.
For example, the measured values can be affected by variables which are not
included in the model, the variables may not be accurately measured, there is
unmeasured genetic variability in the study population, and all of which are
covered in the error . The estimate of is obtained by using the least squares
principle by minimizing . The least squares estimate of is given
by
(11.19)
The vector contains the estimates of . We can interpret it as
earlier: is the estimate of the intercept which we obtain because we have
added the column of 1’s (and is identical to in (11.1)). The estimates
refer to the regression coefficients associated with the variables
, respectively. The interpretation of is that it represents the partial
change in when the value of changes by one unit keeping all other
covariates fixed.
A possible interpretation of the intercept term is that when all covariates
equal zero then
(11.20)
There are many situations in real life for which there is no meaningful
interpretation of the intercept term because one or many covariates cannot take
the value zero. For instance, in the pizza data set, the bill can never be €0, and
thus, there is no need to describe the average delivery time for a given bill of €0.
The intercept term here serves the purpose of improvement of the model fit, but
it will not be interpreted.
In some situations, it may happen that the average value of y is zero when all
covariates are zero. Then, the intercept term is zero as well and does not improve
the model. For example, suppose the salary of a person depends on two factors
—education level and type of qualification. If any person is completely illiterate,
even then we observe in practice that his salary is never zero. So in this case,
there is a benefit of the intercept term. In another example, consider that the
velocity of a car depends on two variables—acceleration and quantity of petrol.
If these two variables take values of zero, the velocity is zero on a plane surface.
The intercept term will therefore be zero as well and yields no model
improvement.

Example 11.6.2
Consider the pizza data described in Appendix A.4. The data matrix is as
follows:

Suppose the manager has the hypothesis that the operator and the overall bill (as
a proxy for the amount ordered from the customer) influence the delivery time.
We can postulate a linear model to describe this relationship as
The model in matrix notation is as follows:

To understand the associations in the data, we need to estimate which we


obtain by the least squares estimator . Rather than doing this
tiresome task manually, we use R:

If we have more than one covariate in a linear regression model, we simply


add all of them separated by the sign when specifying the model formula in
the lm() function. We obtain . The interpretation of
these parameters is as follows:
For each extra € that is spent, the delivery time increases by 0.26 min. Or,
for each extra €10 spent, the delivery time increases on average by 2.6 min.
The delivery time is on average 0.16 min longer for operator 1 compared
with operator 0.
For operator 0 and a bill of €0, the expected delivery time is min.
However, there is no bill of €0, and therefore, the intercept cannot be
interpreted meaningfully here and is included only for improvement of the
model fit.

11.6.2 Categorical Covariates


We now consider the case when covariates are categorical rather than
continuous.

Examples
Region: East, West, South, North,
Marital status: single, married, divorced, widowed,
Day: Monday, Tuesday, , Sunday.

We have already described how to treat a categorical variable which consists of


two categories, i.e. a binary variable: one category is replaced by 1’s and the
other category is replaced by 0’s subjects who belong to the category have
on average y-values which are units higher than those subjects with .
Consider a variable x which has more than two categories, say
categories. To include such a variable in a regression model, we create new
variables . Similar to how we treat binary variables, each of
these variables is a dummy variable and consists of 1’s for units which belong
to the category of interest and 0’s for the other category

(11.21)
The category for which we do not create a dummy variable is called the
reference category , and the interpretation of the parameters of the dummy
variables is with respect to this category. For example, for category i, the -
values are on average higher than for the reference category. The concept of
creating dummy variables and interpreting them is explained in the following
example.

Example 11.6.3
Consider again the pizza data set described in Appendix A.4. The manager may
hypothesize that the delivery times vary with respect to the branch. There are
branches: East, West, Centre. Instead of using the variable , we
create ( ), i.e. new variables denoting and . We
set for those deliveries which come from the branch in the East and set
for other deliveries. Similarly, we set for those deliveries which
come from the West and for other deliveries. The data then is as follows:
Deliveries which come from the East have and , deliveries which
come from the West have and , and deliveries from the Centre have
and . The regression model of interest is thus

We can calculate the least squares estimate via R:


either (i) we create the dummy variables ourselves or (ii) we ask R to create it for
us. This requires that “branch” is a factor variable (which it is in our data, but if
it was not, then we would have to define it in the model formula via
as.factor()).

We obtain the following results:

The interpretations are as follows:


The average delivery time for the branch in the Centre is 36.3 min, the
delivery time for the branch in the East is min, and the
predicted delivery time for the West is min.
Therefore, deliveries arrive on average 5.2 min earlier in the East compared
with the centre ( ), and deliveries in the West arrive on average
1.1 min earlier than in the centre ( ). In other words, it takes
5.2 min less to deliver a pizza in the East than in the Centre. The deliveries
in the West take on average 1.1 min less than in the Centre.

Consider now a covariate with categories for which we create two new
dummy variables, and . The linear model is .

For , we obtain ;

For , we obtain ;
and
For the reference category ( ), we obtain
.

We can thus conclude that the intercept describes the


average for the reference category, that
describes the difference between the
first and the reference category, and that
describes the difference between the second and the reference category.

Remark 11.6.1
There are other ways of recoding categorical variables, such as effect coding .
However, we do not describe them in this book.

11.6.3 Transformations
As we have already outlined in Sect. 11.5, if is transformed by a function
then

(11.22)
is still a linear model because the model is linear in its parameters . Popular
transformations are , among others. The choice of such a
function is typically motivated by the application and data at hand. Alternatively,
a relationship between and can be modelled via a polynomial as follows:

(11.23)

Example 11.6.4
Consider again Examples 11.2.1, 11.3.2, 11.3.3, 11.4.1, and 11.5.1 where we
analysed the association of intensity of exercising and recovery time after knee
surgery. The linear regression line was estimated as

One could question whether the association is indeed linear and fit the second-
and third-order polynomials:

To obtain the estimates we can use R:

The results are and


for the two models respectively. While it
is difficult to interpret these coefficients directly, we can simply plot the
regression lines on a scatter plot (see Fig. 11.6).

Fig. 11.6 Scatter plot and regression lines for Example 11.6.4

We see that the regression based on the quadratic polynomial visually fits the
data slightly better than the linear polynomial. It seems as if the relation between
recovery time and exercise time is not exactly linear and the association is better
modelled through a second-order polynomial. The regression line based on the
cubic polynomial seems to be even closer to the measured points; however, the
functional form of the association looks questionable. Driven by a single data
point, we obtain a regression line which suggests a heavily increased recovery
time for exercising times greater than 65 min. While it is possible that too much
exercising causes damage to the knee and delays recovery, the data does not
seem to be clear enough to support the shape suggested by the model. This
example illustrates the trade-off between fit (i.e. how good the regression line
fits the data) and parsimony (i.e. how well a model can be understood if it
becomes more complex). Section 11.8 explores this non-trivial issue in more
detail.

Transformations of the Outcome Variable. It is also possible to apply


transformations to the outcome variable. While in many situations, this makes
interpretations quite difficult, a transformation is quite common and easy to
interpret. Consider the log-linear model

where y is required to be greater than 0. Exponentiating on both sides leads to

It can be easily seen that a one unit increase in multiplies by . Therefore,


if is log-transformed, one simply interprets instead of . For example,
if is the yearly income, denotes gender (1 = male, 0 = female), and
then one can say that a man’s income is times higher (i.e. 22 %)
than a woman’s.

11.7 The Inductive View of Linear Regression


So far we have introduced linear regression as a method which describes the
relationship between dependent and independent variables through the best fit to
the data. However, as we have highlighted in earlier chapters, often we are
interested not only in describing the data but also in drawing conclusions from a
sample about the population of interest. For instance, in an earlier example, we
have estimated the association of branch and delivery time for the pizza delivery
data. The linear model was estimated as ;
this indicates that the delivery time is on average 5.2 min shorter for the branch
in the East of the town compared with the central branch and 1.1 min shorter for
the branch in the West compared with the central branch. When considering all
pizza deliveries (the population), not only those collected by us over the period
of a month (the sample), we might ask ourselves whether 1.1 min is a real
difference valid for the entire population or just caused by random error involved
in the sample we chose. In particular, we would also like to know what the 95 %
confidence interval for our parameter estimate is. Does the interval cover
“zero”? If yes, we might conclude that we cannot be very sure that there is an
association and do not reject the null hypothesis that . In reference to
Chaps. 9 and 10, we now apply the concepts of statistical inference and testing in
the context of regression analysis.
Point and interval estimation in the linear model. We now rewrite the
model for the purpose of introducing statistical inference to linear regression as

(11.24)
where is the vector of outcomes and is the design matrix
(including a column of 1’s for the intercept). The identity matrix consists of 1’s
on the diagonal and 0’s elsewhere, and the parameter vector is .
We would like to estimate to make conclusions about a relationship in the
population. Similarly, reflects the random errors in the population. Most
importantly, the linear model now contains assumptions about the errors. They
are assumed to be normally distributed, , which means that the
expectation of the errors is , the variance is (and therefore
the same for all ), and it follows from that for all
. The assumption of a normal distribution is required to construct
confidence intervals and test of hypotheses statistics. More details about these
assumptions and the procedures to test their validity on the basis of a given
sample of data are explained in Sect. 11.9.
The least squares estimate of is obtained by

(11.25)
It can be shown that follow a normal distribution with mean and
covariance matrix as
(11.26)
Note that is unbiased (since ); more details about (11.26) can be
found in Appendix C.6. An unbiased estimator of is

(11.27)
The errors are estimated from the data as and are called residuals
.
Before giving a detailed data example, we would like to outline how to draw
conclusions about the population of interest from the linear model. As we have
seen, both and are unknown in the model and are estimated from the data
using (11.25) and (11.27). These are our point estimates. We note that if ,
then , and then, the model will not contain the term . This means that
the covariate does not contribute to explaining the variation in . Testing the
hypothesis is therefore equivalent to testing whether is associated with
or not in the sense that it helps to explain the variations in or not. To test
whether the point estimate is different from zero, or whether the deviations of
estimates from zero can be explained by random variation, we have the
following options:

1. The % confidence interval for each is

(11.28)
with where is the jth diagonal element of the matrix .
If the confidence interval does not overlap 0, then we can conclude that is
different from 0 and therefore is associated with and it is a relevant
variable. If the confidence interval includes 0, we cannot conclude that there
is an association between and .

2. We can formulate a formal test which tests if is different from 0. The


hypotheses are:
The test statistic

follows a distribution under . If , we can reject the


null hypothesis (at level of significance); otherwise, we accept it.

3. The decisions we get from the confidence interval and the T-statistic from
points 1 and 2 are identical to those obtained by checking whether the p-
value (see also Sect. 10.2.6) is smaller than , in which case we also reject
the null hypothesis.

Example 11.7.1
Recall Examples 11.2.1, 11.3.2, 11.3.3, 11.4.1, 11.5.1, and 11.6.4 where we
analysed the association of exercising time and time of recovery from knee
surgery. We can use R to obtain a full inductive summary of the linear model:

Now, we explain the meaning and interpretation of different terms involved


in the output.
Under “Estimate”, the parameter estimates are listed and we read that the
linear model is fitted as .
The subtle differences to Example 11.2.1 are due to rounding of numerical
values.
The variance is estimated as (“Residual standard error”). This is
easily calculated manually using the residual sum of squares:
.

The standard errors are listed under “Std. Error”. Given that
, and therefore , we can construct a
confidence interval for age:

The interval does not include 0, and we therefore conclude that there is an
association between exercising time and recovery time. The random error
involved in the data is not sufficient to explain the deviation of
from 0 (given ).
We therefore reject the null hypothesis that . This can also be seen by
comparing the test statistic (listed under “t value” and obtained by
) with . Moreover,
. We can say that there is a significant association
between exercising and recovery time.

Sometimes, one is interested in whether a regression model is useful in the sense


that all ’s are different from zero, and we therefore can conclude that there is
an association between any and . The null hypothesis is

and can be tested by the overall F -test

The null hypothesis is rejected if . Note that the null hypothesis in


this case tests only the equality of slope parameters and does not include the
intercept term.

Example 11.7.2
In this chapter, we have already explored the associations between branch,
operator, and bill with delivery time for the pizza data (Appendix A.4). If we fit
a multiple linear model including all of the three variables, we obtain the
following results:
By looking at the p-values in the last column, one can easily see (without
calculating the confidence intervals or evaluating the t-statistic) that there is a
significant association between the bill and delivery time; also, it is evident that
the average delivery time in the branch in the East is significantly different (
3 min less) from the central branch (which is the reference category here).
However, the estimated difference in delivery times for both the branches in the
West and the operator was not found to be significantly different from zero. We
conclude that some variables in the model are associated with delivery time,
while for others, we could not show the existence of such an association. The
last line of the output confirms this: the overall F-test has a test statistic of 97.87
which is larger than ; therefore, the p-value is
smaller 0.05 ( ) and the null hypothesis is rejected. The test suggests
that there is at least one variable which is associated with delivery time.

11.7.1 Properties of Least Squares and Maximum


Likelihood Estimators
The least squares estimator of has several properties:

1. The least squares estimator of is unbiased, (see Appendix C.6 for


the proof).
2. The estimator as introduced in equation (11.27) is also unbiased, i.e.
.

3. The least squares estimator of is consistent, i.e. converges to as n


approaches infinity.

4. The least squares estimator of is asymptotically normally distributed.

5. The least squares estimator of has the smallest variance among all linear
and unbiased estimators (best linear unbiased estimator) of .
We do not discuss the technical details of these properties in detail. It is more
important to know that the least squares estimator has good features and that is
why we choose it for fitting the model. Since we use a “good” estimator, we
expect that the model will also be “good”. One may ask whether it is possible to
use a different estimator. We have already made distributional assumptions in the
model: we require the errors to be normally distributed, given that it is indeed
possible to apply the maximum likelihood principle to obtain estimates for and
in our set-up.

Theorem 11.7.1
For the linear model (11.24), the least squares estimator and the maximum
likelihood estimator for are identical. However, the maximum likelihood
estimator of is of which is a biased estimator of , but it is
asymptotically unbiased.

How to obtain the maximum likelihood estimator for the linear model is
presented in Appendix C.6. The important message of Theorem 11.7.1 is that no
matter whether we apply the least squares principle or the maximum likelihood
principle, we always obtain ; this does not apply when estimating
the variance, but it is an obvious choice to go for the unbiased estimator (11.27)
in any given analysis.
11.7.2 The ANOVA Table
A table that is frequently shown by software packages when performing
regression analysis is the analysis of variance (ANOVA) table. This table can
have several meanings and interpretations and may look slightly different
depending on the context in which it is used. We focus here on its interpretation
i) as a way to summarize the effect of categorical variables and ii) as a
hypothesis test to compare k means. This is illustrated in the following example.

Example 11.7.3
Recall Example 11.6.3 where we established the relationship between branch
and delivery time as

The R output for the respective linear model is as follows:

We see that the average delivery time of the branches in the East and the
Centre (reference category) is different and that the average delivery time of the
branches in the West and the Centre is different (because ). This is useful
information, but it does not answer the question if branch as a whole influences
the delivery time. It seems that this is the case, but the hypothesis we may have
in mind may be

which corresponds to

in the context of the regression model. These are two identical hypotheses
because in the regression set-up, we are essentially comparing three conditional
means . The ANOVA table summarizes the
corresponding F-Test which tests this hypothesis:
We see that the null hypothesis of 3 equal means is rejected because p is
close to zero.

What does this table mean more generally? If we deal with linear regression with
one (possibly categorical) covariate, the table will be as follows:

df Sum of squares Mean squares F-statistic


Var p SQ MSR=SQ /p MSR/MSE

Res SQ MSE= SQ /

The table summarizes the sum of squares regression and residuals (see
Sect. 11.3 for the definition), standardizes them by using the appropriate degrees
of freedom (df), and uses the corresponding ratio as the F-statistic. Note that in
the above example, this corresponds to the overall F-test introduced earlier. The
overall F-test tests the hypothesis that any is different from zero which is
identical to the hypothesis above. Thus, if we fit a linear regression model with
one variable, the ANOVA table will yield the same conclusions as the overall F-
test which we obtain through the main summary. However, if we consider a
multiple linear regression model, the ANOVA table may give us more
information.

Example 11.7.4
Suppose we are not only interested in the branch, but also in how the pizza
delivery times are affected by operator and driver. We may for example
hypothesize that the delivery time is identical for all drivers given they deliver
for the same branch and speak to the same operator. In a standard regression
output, we would get 4 coefficients for 5 drivers which would help us to
compare the average delivery time of each driver to the reference driver; it
would however not tell us if on an average, they all have the same delivery time.
The overall F-test would not help us either because it would test if any is
different from zero which includes coefficients from branch and operator, not
only driver. Using the anova command yields the results of the F-test of
interest:
We see that the null hypothesis of equal delivery times of the drivers is
rejected. We can also test other hypotheses with this output: for instance, the null
hypothesis of equal delivery times for each operator is not rejected because
.

11.7.3 Interactions
It may be possible that the joint effect of some covariates affects the response.
For example, drug concentrations may have a different effect on men, woman,
and children; a fertilizer could work differently in different geographical areas;
or a new education programme may show benefit only with certain teachers. It is
fairly simple to target such questions in linear regression analysis by using
interactions . Interactions are measured as the product of two (or more)
variables. If either one or both variables are categorical, then one simply builds
products for each dummy variable, thus creating new variables
when dealing with two categorical variables (with k and l categories
respectively). These product terms are called interactions and estimate how an
association of one variable differs with respect to the values of the other
variable. Now, we give examples for continuous–categorical, categorical–
categorical, and continuous–continuous interactions for two variables and .
Categorical–Continuous Interaction. Suppose one variable is
categorical with k categories, and the other variable is continuous. Then,
new variables have to be created, each consisting of the product of the
continuous variable and a dummy variable, . These
variables are added to the regression model in addition to the main effects
relating to and as follows:
It follows that for the reference category of , the effect of is just
(because each is zero since each is zero). However, the effect for all
other categories is where refers to . Therefore, the association
between and the outcome differs by between category j and the reference
category. Testing thus helps to identify whether there is an interaction
effect with respect to category j or not.

Example 11.7.5
Consider again the pizza data described in Appendix A.4. We may be interested
in whether the association of delivery time and temperature varies with respect
to branch. In addition to time and branch, we therefore need additional
interaction variables. Since there are 3 branches, we need interaction
variables which are essentially the product of (i) time and branch “East” and (ii)
time and branch “West”. This can be achieved in R by using either the “
” operator (which will create both the main and interaction effects) or the
“ : ” operator (which only creates the interaction term).
Fig. 11.7 Interaction of delivery time and branch in Example 11.7.5

The main effects of the model tell us that the temperature is almost 11
degrees higher for the eastern branch compared to the central branch (reference)
and about 1 degree higher for the western branch. However, only the former
difference is significantly different from 0 (since the p-value is smaller than
). Moreover, the longer the delivery time, the lower the temperature
(0.29 degrees for each minute). The parameter estimates related to the interaction
imply that this association differs by branch: the estimate is indeed
for the reference branch in the Centre, but the estimate for the branch in the East
is and the estimate for the branch in the West is
. However, the latter difference in the association of time
and temperature is not significantly different from zero. We therefore conclude
that the delivery time and pizza temperature are negatively associated and this is
more strongly pronounced in the eastern branch compared to the other two
branches. It is also possible to visualize this by means of a separate regression
line for each branch, see Fig. 11.7.
Centre: ,
East: ,
West: .

One can see that the pizzas delivered by the branch in the East are overall hotter
but longer delivery times level that benefit off. One might speculate that the
delivery boxes from the eastern branch are not properly closed and therefore—
despite the overall good performance—the temperature falls more rapidly over
time for this branch.
Categorical–Categorical Interaction. For two categorical variables and
, with k and l categories, respectively, new dummy variables
need to be created as follows:

The interpretation of the regression parameters of interest is less complicated


than it looks at first. If the interest is in , then the estimate referring to the
category of interest (i.e. ) is interpreted as usual—with the difference that it
relates only to the reference category of . The effect of may vary with
respect to , and the sum of the respective main and interaction effects for each
category of yields the respective estimates. These considerations are
explained in more detail in the following example.

Example 11.7.6
Consider again the pizza data. If we have the hypothesis that the delivery time
depends on the operator (who receives the phone calls), but the effect is different
for different branches, then a regression model with branch (3 categories, 2
dummy variables), operator (2 categories, one dummy variable), and their
interaction (2 new variables) can be used.
The interaction terms can be interpreted as follows:
If we are interested in the operator, we see that the delivery time is on
average 0.21 min shorter for operator “Melissa”. When this operator deals
with a branch other than the reference (Centre), the estimate changes to
min longer delivery in the case of branch “East” and
min for branch “West”.
If we are interested in the branches, we observe that the delivery time is
shortest for the eastern branch which has on average a 5.66 min shorter
delivery time than the central branch. However, this is the estimate for the
reference operator only; if operator “Melissa” is in charge, then the
difference in delivery times for the two branches is only
min. The same applies when comparing the western branch with the central
branch: depending on the operator, the difference between these two
branches is estimated as either or min,
respectively.
The interaction terms are not significantly different from zero. We therefore
conclude that the hypothesis of different delivery times for the two
operators, possibly varying by branch, could not be confirmed.

Continuous–Continuous Interaction. It is also possible to add an interaction of


two continuous variables. This is done by adding the product of these two
variables to the model. If and are two continuous variables, then is the
new variable added in the model as an interaction effect as

Example 11.7.7
If we again consider the pizza data, with pizza temperature as an outcome, we
may wish to test for an interaction of bill and time.

The R output above reveals that there is a significant interaction effect. The
interpretation is more difficult here. It is clear that a longer delivery time and a
larger bill decrease the pizza’s temperature. However, for a large product of bill
and time (i.e., when both are large), these associations are less pronounced
because the negative coefficients become more and more outweighed by the
positive interaction term. On the contrary, for a small bill and short delivery
time, the temperature can be expected to be quite high.

Combining Estimates. As we have seen in the examples in this section, it can


make sense to combine regression estimates when interpreting interaction
effects. While it is simple to obtain the point estimates, it is certainly also
important to report their 95 % confidence intervals. As we know from Theorem
7.7.1, the variance of the combination of two random variables can be obtained
by . We can therefore estimate a
confidence interval for the combined estimate as

(11.29)
where is obtained from the estimated covariance matrix via

Example 11.7.8
Recall Example 11.7.5 where we estimated the association between pizza
temperature, delivery time, and branch. There was a significant interaction effect
for time and branch. Using R, we obtain the covariance matrix as follows:

The point estimate for the association between time and temperature in the
eastern branch is (see Example 11.7.5). The standard error is
calculated as . The confidence interval is
therefore .

Remark 11.7.1
If there is more than one interaction term, then it is generally possible to test
whether there is an overall interaction effect, such as in the
case of four interaction variables. These tests belong to the general class of
“linear hypotheses”, and they are not explained in this book. It is also possible to
create interactions between more than two variables. However, the interpretation
then becomes difficult.

11.8 Comparing Different Models


There are many situations where different multiple linear models can be fitted to
a given data set, but it is unclear which is the best one. This relates to the
question of which variables should be included in a model and which ones can
be removed.
For example, when we modelled the association of recovery time and
exercising time by using different transformations, it remained unclear which of
the proposed transformations was the best. In the pizza delivery example, we
have about ten covariates: Does it make sense to include all of them in the model
or do we understand the associations in the data better by restricting ourselves to
the few most important variables? Is it necessary to include interactions or not?
There are various model selection criteria to compare the quality of different
fitted models. They can be useful in many situations, but there are also a couple
of limitations which make model selection a difficult and challenging task.
Coefficient of Determination ( ). A possible criterion to check the
goodness of fit of a model is the coefficient of determination ( ). We discussed
its development, interpretation, and philosophy in Sect. 11.3 (see p. 256 for more
details).
Adjusted . Although is a reasonable measure for the goodness of fit
of a model, it also has some limitations. One limitation is that if the number of
covariates increases, increases too (we omit the theorem and the proof for this
statement). The added variables may not be relevant, but will still increase,
wrongly indicating an improvement in the fitted model. This criterion is
therefore not a good measure for model selection. An adjusted version of is
defined as
(11.30)
It can be used to compare the goodness of fit of models with a different
number of covariates. Please note that both and are only defined when
there is an intercept term in the model (which we assume throughout the
chapter).

Example 11.8.1
In Fig. 11.6, the association of exercising time and recovery time after knee
surgery is modelled linearly, quadratically, and cubically; in Fig. 11.5, this
association is modelled by means of a square-root transformation. The model
summary in R returns both (under “Multiple R-squared”) and the adjusted
(under “adjusted R-squared”). The results are as follows:

Linear 0.6584 0.6243


Quadratic 0.6787 0.6074
Cubic 0.7151 0.6083
Square root 0.6694 0.6363

It can be seen that is larger for the models with more variables; i.e. the
cubic model (which includes three variables) has the largest . The adjusted ,
which takes the different model sizes into account, favours the model with the
square-root transformation. This model provides therefore the best fit to the data
among the four models considered using .

Akaike’s Information Criterion (AIC) is another criterion for model selection.


The AIC is based on likelihood methodology and has its origins in information
theory. We do not give a detailed motivation of this criterion here. It can be
written for the linear model as

(11.31)
The smaller the AIC, the better the model. The AIC takes not only the fit to the
data via into account but also the parsimony of the model via the term
. It is in this sense a more mature criterion than which considers only
the fit to the data via . Akaike’s Information Criterion can be calculated in
R via the extractAIC() command. There are also other commands, but the
results differ slightly because the different formulae use different constant terms.
However, no matter what formula is used, the results regarding the model choice
are always the same.

Example 11.8.2
Consider again Example 11.8.1. preferred the model which includes
exercising time via a square-root transformation over the other options. The AIC
value for the model where exercise time is modelled linearly is 60.59, when
modelled quadratically 61.84, when modelled cubically 62.4, and 60.19 for a
square-root transformation. Thus, in line with , the AIC also prefers the
square-root transformation.

Backward selection. Two models, which differ only by one variable , can be
compared by simply looking at the test result for : if the null hypothesis is
rejected, the variable is kept in the model; otherwise, the other model is chosen.
If there are more than two models, then it is better to consider a systematic
approach to comparing them. For example, suppose we have 10 potentially
relevant variables and we are not sure which of them to include in the final
model. There are possible different combinations of variables and in
turn so many choices of models! The inclusion or deletion of variables can be
done systematically with various procedures, for example with backward
selection (also known as backward elimination) as follows:

1. Start with the full model which contains all variables of interest,
.

2. Remove the variable which optimizes a criterion, i.e. leads to the


smallest AIC (the highest , the highest test statistic, or the smallest
significance) among all possible choices.

3. Repeat step 2. until a stop criterion is fulfilled, i.e. until no improvement


regarding AIC, , or the p-value can be achieved.
There are several other approaches. Instead of moving in the backward
direction, we can also move in the forward direction. Forward selection starts
with a model with no covariates and adds variables as long as the model
continues to improve with respect to a particular criterion. The stepwise
selection approach combines forward selection and backward selection: either
one does backward selection and checks in between whether adding variables
yields improvement, or one does forward selection and continuously checks
whether removing the already added variables improves the model.

Example 11.8.3
Consider the pizza data: if delivery time is the outcome, and branch, bill,
operator, driver, temperature, and number of ordered pizzas are potential
covariates, we may decide to include only the relevant variables in a final model.
Using the stepAIC function of the library(MASS) allows us implementing
backward selection with R.

At the first step, the R output states that the AIC of the full model is 4277.56.
Then, the AIC values are displayed when variables are removed: 4275.9 if
operator is removed, 4279.2 if driver is removed, and so on. One can see that the
AIC is minimized if operator is excluded.

Now R fits the model without operator. The AIC is 4275.88. Excluding
further variables does not improve the model with respect to the AIC.
We therefore conclude that the final “best” model includes all variables
considered, except operator. Using stepwise selection (with option both instead
of back) yields the same results. We could now interpret the summary of the
chosen model.

Limitations and further practical considerations. The above discussions of


variable selection approaches makes it clear that there are a variety of options to
compare fitted models, but there is no unique best way to do so. Different
criteria are based on different philosophies and may lead to different outcomes.
There are a few considerations that should however always be taken into
account:
If categorical variables are included in the model, then all respective
dummy variables need to be considered as a whole: all dummy variables of
this variable should either be removed or kept in the model. For example, a
variable with 3 categories is represented by two dummy variables in a
regression model. In the process of model selection, both of these variables
should either be kept in the model or removed, but not just one of them.
A similar consideration refers to interactions. If an interaction is added to
the model, then the main effects should be kept in the model as well. This
enables straightforward interpretations as outlined in Sect. 11.7.3.
The same applies to polynomials. If adding a cubic transformation of , the
squared transformation as well as the linear effect should be kept in the
model.
When applying backward, forward, or stepwise regression techniques, the
results may vary considerably depending on the method chosen! This may
give different models for the same data set. Depending on the strategy used
to add and remove variables, the choice of the criterion (e.g. AIC versus p-
values), and the detailed set-up (e.g. if the strategy is to add/remove a
variable if the p-value is smaller than , then the choice of is important),
one may obtain different results.
All the above considerations show that model selection is a non-trivial task
which should always be complemented by substantial knowledge of the
subject matter. If there are not too many variables to choose from, simply
reporting the full model can be meaningful too.

11.9 Checking Model Assumptions


The assumptions made for the linear model mostly relate to the error terms:
. This implies a couple of things: i) the difference between and
(which is ) needs to follow a normal distribution, ii) the variance for each error
is constant as and therefore does not depend on i, and iii) there is no
correlation between the error terms, for all . We also assume
that is adequate in the sense that we included all relevant variables and
interactions in the model and that the estimator is stable and adequate due to
influential observations or highly correlated variables. Now, we demonstrate
how to check assumptions i) and ii).
Normality assumption. The errors are assumed to follow a normal
distribution. We never know the errors themselves. The only possibility is to
estimate the errors by means of the residuals . To ensure that the
estimated residuals are comparable to the scale of a standard normal distribution,
they can be standardized via

where is the ith diagonal element of the hat matrix .


Generally, the estimated standardized residuals are used to check the normality
assumption. To obtain a first impression about the error distribution, one can
simply plot a histogram of the standardized residuals. The normality assumption
is checked with a QQ-plot where the theoretical quantiles of a standard normal
distribution are plotted against the quantiles from the standardized residuals. If
the data approximately matches the bisecting line, then we have evidence for
fulfilment of the normality assumption (Fig. 11.8a), otherwise we do not
(Fig. 11.8b).

Fig. 11.8 Checking the normality assumption with a QQ-plot

Small deviations from the normality assumption are not a major problem as
the least squares estimator of remains unbiased. However, confidence intervals
and tests of hypothesis rely on this assumption, and particularly for small sample
sizes and/or stronger violations of the normality assumptions, these intervals and
conclusions from tests of hypothesis no longer remain valid. In some cases, these
problems can be fixed by transforming .
Heteroscedasticity. If errors have a constant variance, we say that the errors
are homoscedastic and we call this phenomenon homoscedasticity . When the
variance of errors is not constant, we say that the errors are heteroscedastic. This
phenomenon is also known as heteroscedasticity . If the variance depends on
i, then the variability of the will be different for different groups of
observations. For example, the daily expenditure on food ( ) may vary more
among persons with a high income ( ), so fitting a linear model yields stronger
variability of the among higher income groups. Plotting the fitted values (or
alternatively ) against the standardized residuals (or a transformation thereof)
can help to detect whether or not problems exist. If there is no pattern in the plot
(random plot) , then there is likely no violation of the assumption (see
Fig. 11.10a). However, if there is a systematic trend, i.e. higher/lower variability
for higher/lower , then this indicates heteroscedasticity (Fig. 11.10b, trumpet
plot ). The consequences are again that confidence intervals and tests may no
longer be correct; again, in some situations, a transformation of can help.

Example 11.9.1
Recall Example 11.8.3 where we identified a good model to describe the
delivery time for the pizza data. Branch, bill, temperature, number of pizzas
ordered, and driver were found to be associated with delivery time. To explore
whether the normality assumption is fulfilled for this model, we can create a
histogram for the standardized residuals (using the R function rstandard()).
A QQ-plot is contained in the various model diagnostic plots of the plot()
function.

Fig. 11.9 Checking the normality assumption in Example 11.9.1

Figure 11.9 shows a reasonably symmetric distribution of the residuals,


maybe with a slightly longer tail to the right. As expected for a standard normal
distribution, not many observations are larger than 2 or smaller than (which
are close to the 2.5 and 97.5 % quantiles). The QQ-plot also points towards a
normal error distribution because the observed quantiles lie approximately on the
bisecting line. It seems that the lowest residuals deviate a bit from the expected
normal distribution, but not extremely. The normality assumption does not seem
to be violated.
Fig. 11.10 Checking the heteroscedasticity assumption

Plotting the fitted values against the square root of the absolute values of
the standardized residuals ( , used by R for stability reasons) yields a plot
with no visible structure (see Fig. 11.10a). There is no indication of
heteroscedasticity.

Figure 11.10b shows an artificial example where heteroscedasticity occurs:


the higher the fitted values, the higher is the variation of residuals. This is also
called a trumpet plot .

11.10 Association Versus Causation


There is a difference between association and causation: association says that
higher values of X relate to higher values of Y (or vice versa), whereas causation
says that because values of X are higher, values of Y are higher too. Regression
analysis, as introduced in this chapter, establishes associations between ’s and
Y, not causation, unless strong assumptions are made.
For example, recall the pizza delivery example from Appendix A.4. We have
seen in several examples and exercises that the delivery time varies by driver. In
fact, in a multiple linear regression model where Y is the delivery time, and the
covariates are branch, bill, operator, driver, and temperature, we get significant
differences of the mean delivery times of some drivers (e.g. Domenico and
Bruno). Does this mean that because Domenico is the driver, the pizzas are
delivered faster? Not necessarily. If Domenico drives the fastest scooter, then
this may be the cause of his speedy deliveries. However, we have not measured
the variable “type of scooter” and thus cannot take it into account. The result we
obtain from the regression model is still useful because it provides the manager
with useful hypotheses and predictions about his delivery times, but the
interpretation that the driver’s abilities cause shorter delivery times may not
necessarily be appropriate.
This example highlights that one of the most important assumptions to
interpret regression parameters in a causal way is to have measured (and used)
all variables which affect both the outcome Y and the variable of interest A
(e.g. the variable “driver” above). Another assumption is that the relationship
between all ’s and Y is modelled correctly, for example non-linear if
appropriate. Moreover, we need some a priori assumptions about how the
variables relate to each other, and some technical assumptions need to be met as
well. The interested reader is referred to Hernan and Robins (2017) for more
details.

11.11 Key Points and Further Issues


11.12 Exercises
Exercise 11.1
The body mass index (BMI) and the systolic blood pressure of 6 people were
measured to study a cardiovascular disease. The data are as follows:

Body mass index 26 23 27 28 24 25


Systolic blood pressure 170 150 160 175 155 150

(a) The research hypothesis is that a high BMI relates to a high blood pressure.
Estimate the linear model where blood pressure is the outcome and BMI is
the covariate. Interpret the coefficients.

(b) Calculate to judge the goodness of fit of the model.

Exercise 11.2
A psychologist speculates that spending a lot of time on the internet has a
negative effect on children’s sleep. Consider the following data on hours of deep
sleep (Y) and hours spent on the internet (X) where and are the observations
on internet time and deep sleep time of the child respectively:

Child i 1 2 3 4 5 6 7 8 9
Internet time (in h) 0.3 2.2 0.5 0.7 1.0 1.8 3.0 0.2 2.3

Sleep time (in h) 5.8 4.4 6.5 5.8 5.6 5.0 4.8 6.0 6.1

(a) Estimate the linear regression model for the given data and interpret the
coefficients.

(b) Calculate to judge the goodness of fit of the model.

(c) Reproduce the results of a) and b) in R and plot the regression line.

(d) Now assume that we only distinguish between spending more than 1 hour
on the internet ( ) and spending less than (or equal to) one hour on the
internet ( ). Estimate the linear model again and compare the results.
How can now be interpreted? Describe how changes if those who
spend more than one hour on the internet are coded as 0 and the others as 1.
Exercise 11.3
Consider the following data on weight and height of 17 female students:

Student i 1 2 3 4 5 6 7 8 9
Weight 68 58 53 60 59 60 55 62 58

Height 174 164 164 165 170 168 167 166 160

Student i 10 11 12 13 14 15 16 17
Weight y 53 53 50 64 77 60 63 69
Height x 160 163 157 168 179 170 168 170

(a) Calculate the correlation coefficient of Bravais–Pearson (use


, ,
). What does this imply with respect to a linear regression of height on
weight?

(b) Now estimate and interpret the linear regression model where “weight” is
the outcome.

(c) Predict the weight of a student with a height 175 cm.

(d) Now produce a scatter plot of the data (manually or by using R) and
interpret it.

(e) Add the following two points to the scatter plot and
. Speculate how the linear regression estimate will
change after adding these points.

(f) Re-estimate the model using all 19 observations and and


.

(g) Given the results of the two regression models: What are the general
implications with respect to the least squares estimator of ?
Exercise 11.4
To study the association of the monthly average temperature (in C, X) and hotel
occupation (in ), we consider data from three cities: Polenca (Mallorca,
Spain) as a summer holiday destination, Davos (Switzerland) as a winter skiing
destination, and Basel (Switzerland) as a business destination.

Month Davos Polenca Basel


X Y X Y X Y
Jan 6 91 10 13 1 23
Feb 5 89 10 21 0 82
Mar 2 76 14 42 5 40
Apr 4 52 17 64 9 45
May 7 42 22 79 14 39
Jun 15 36 24 81 20 43
Jul 17 37 26 86 23 50
Aug 19 39 27 92 24 95
Sep 13 26 22 36 21 64
Oct 9 27 19 23 14 78
Nov 4 68 14 13 9 9
Dec 0 92 12 41 4 12

(a) Interpret the following regression model output where the outcome is
“hotel occupation” and “temperature” is the covariate.

(b) Interpret the following output where “city” is treated as a covariate and
“hotel occupation” is the outcome.

(c) Interpret the following output and compare it with the output from b):
(d) In the following multiple linear regression model, both “city” and
“temperature” are treated as covariates. How can the coefficients be
interpreted?

(e) Now consider the regression model for hotel occupation and temperature
fitted separately for each city: How can the results be interpreted and what
are the implications with respect to the models estimated in (a)–(d)? How
can the models be improved?

(f) Describe what the design matrix will look like if city, temperature, and the
interaction between them are included in a regression model.
(g) If the model described in (f) is fitted the output is as follows:

Interpret the results.

(h) Summarize the association of temperature and hotel occupation by city—


including 95 % confidence intervals—using the interaction model. The
covariance matrix is as follows:

Exercise 11.5
The theatre data (see Appendix A.4) describes the monthly expenditure on
theatre visits of 699 residents of a Swiss city. It captures not only the expenditure
on theatre visits (in SFR) but also age, gender, yearly income (in 1000 SFR), and
expenditure on cultural activities in general as well as expenditure on theatre
visits in the preceding year.

(a) The summary of the multiple linear model where expenditure on theatre
visits is the outcome is as follows:
How can the missing values [1] and [2] be calculated?

(b) Interpret the model diagnostics in Fig. 11.11.

(c) Given the diagnostics in (b), how can the model be improved? Plot a
histogram of theatre expenditure in R if you need further insight.

(d) Consider the model where theatre expenditure is log-transformed:

How can the coefficients be interpreted?

(e) Judge the quality of the model from d) by means of Figs. 11.12a and
11.12b. What do they look like compared with those from b)?
Fig. 11.11 Checking the model assumptions

Fig. 11.12 Checking the model assumptions

Exercise 11.6
Consider the pizza delivery data described in Appendix A.4.

(a) Read the data into R. Fit a multiple linear regression model with delivery
time as the outcome and temperature, branch, day, operator, driver, bill,
number of ordered pizzas, and discount customer as covariates. Give a
summary of the coefficients.

(b) Use R to calculate the 95 % confidence intervals of all coefficients. Hint:


the standard errors of the coefficients can be accessed either via the
covariance matrix or the model summary.

(c) Reproduce the least squares estimate of . Hint: use residuals() to


obtain the residuals.

(d) Now use R to estimate both and . Compare the results with the
model output from a).

(e) Use backward selection by means of the stepAIC function from the
library MASS to find the best model according to AIC.

(f) Obtain from the model identified in e) and compare it to the full model
from a).

(g) Identify whether the model assumptions are satisfied or not.

(h) Are all variables from the model in (e) causing the delivery time to be
either delayed or improved?

(i) Test whether it is useful to add a quadratic polynomial of temperature to


the model.

(j) Use the model identified in (e) to predict the delivery time of the last
captured delivery (i.e. number 1266). Use the predict() command to
ease the calculation of the prediction.

Source Toutenburg, H., Heumann, C., Deskriptive Statistik, 7th edition, 2009,
Springer, Heidelberg
Appendix: Introduction to R A
Background
The open-source software R was developed as a free implementation of the
language S which was designed as a language for statistical computation,
statistical programming, and graphics. The main intention was to allow users to
explore data in an easy and interactive way, supported by meaningful graphical
representations. The statistical software R was originally created by Ross Ihaka
and Robert Gentleman (University of Auckland, New Zealand).

Fig. A.1 Screenshot of R Studio with the command window ( top left ), the output window (“console”, i.e.
R itself, bottom left ), and a plot window ( right ). Other windows (e.g. about the environment or package
updates) are closed

Installation and Basic Functionalities


The “base” R version, i.e. the software with its most relevant commands,
can be downloaded from https://www.r-project.org/ . After installing R , it is
recommended to install an editor too. An editor allows the user to
conveniently save and display R -code, submit this code to the R console
(i.e. the R software itself), and control the settings and the output. A popular
choice of editor is R Studio (free of charge) which can be downloaded from
https://www.rstudio.com/ (see Fig. A.1 for a screenshot of the software).
There are alternative good editors, for example “Tinn-R” ( http://
sourceforge.net/projects/tinn-r/ ).
A lot of additional user-written packages are available online and can be
installed within the R console or using the R menu. Within the console, the
install.packages("package to install") function can be
used. Please note that an internet connection is required.
Statistics has a close relationship to algebra: data sets can be seen as
matrices, and variables as vectors. R makes use of these structures and this is
why we first introduce data structure functionalities before explaining some of
the most relevant basic statistical commands.

R as a Calculator, Basic Data Structures and Arithmetic


Operations
The character # marks the beginning of a comment. All characters until the
end of the line are ignored by R . We use # to comment on our R -code.
If we know the name of a command we would like to use, and we want to
learn about its functionality, typing ?command in the R command line
prompt displays a help page, e.g.

displays a help page for the arithmetic mean function.


Using

shows application examples of the respective function.


The command c(1,2,3,4,5) combines the numbers 1, 2, 3, 4 and 5 into
a vector (e.g. a variable).
Vectors can be assigned to an “object”. For example,

assigns a numeric vector of length 4 to the object X . In general, the arrow


sign ( ) is a very important concept to store data, summaries, and outputs
in objects (i.e. the name in the front of the sign). Note that R is case
sensitive; i.e. X and x are two different variable names. Instead of “ ”,
one can also use “ ”.
Sequences of numbers can be created using the seq and rep commands.
For example,

and

yield

and

respectively.
Basic data structures are vectors, matrices, arrays, lists , and data frames
. They can contain numeric values, logical values or even characters
(strings). In the latter case, arithmetic operations are not allowed.
– A numeric vector of length 5 can be constructed by the command

The elements can be accessed by squared brackets: . For example,

assigns the value 4 to the third element of the object x . Logical


vectors containing the values TRUE and FALSE are also possible. In
arithmetic operations, TRUE corresponds to 1 and FALSE
corresponds to 0 (which is the default). Consider the following
example:
returns as output 0.75 because the mean of (1, 0, 1, 1) =(TRUE,
FALSE, TRUE, TRUE) is 0.75.
– A matrix can be constructed by the matrix() command:

creates a matrix, where the data values are the natural numbers
which are stored row-wise in the matrix,

because of the parameter byrow=T (which is equivalent to


byrow=TRUE ). The default is byrow=F which would store the
data column-wise.
– Arrays are more general data structures than vectors and matrices in
the sense that they can have more than two dimensions. For
instance,

creates a three-dimensional array with elements.


– A list can contain objects of different types. For example, a list
element can be a vector or matrix. Lists can be initialized by the
command list and can grow dynamically. It is important to
understand that list elements should be accessed by the name of the
entry via the dollar sign or using double brackets:
– A data frame is the standard data structure for storing a data set with
rows as observations and columns as variables. Many statistical
procedures in (such as the lm function for linear models) expect a
data frame. A data frame is conceptually not much different from a
matrix and can either be initialized by reading a data set from an
external file or by binding several column vectors. As an example,
we consider three variables (age, favourite hobby, and favourite
animal) each with five observations:

The resulting output is

where NA denotes a missing value. With write.table or a


specialized version thereof such as write.csv (for writing the
data in a file using comma-separated fields), a data frame can be
saved in a file. The command sequence
saves the data frame as an external (comma-separated) file and then
loads the data again using read.csv .
Individual elements of the data frame can be accessed using squared
brackets, as for matrices. For example, dat[1,2] returns the first
observation of the second variable column for the data set dat . Individual
columns (variables) can also be selected using the sign:

returns the age column:

The factor command is very useful to store nominal variables, and the
command ordered is ideal for ordinal variables. Both commands are
extremely important since factor variables with more than two categories
are automatically expanded into several columns of dummy variables if
necessary, e.g. if they are included as covariates in a linear model. In the
previous paragraph, two factor variables have already been created. This
can be confirmed by typing

which return the value TRUE . Have a look at the following two factor
variables:

Please note that by default alphabetical order is used to order the categories
(e.g. female is coded as 1 and male as 2). However, the mapping of integers
to strings can be controlled by the user as seen for the variable “grade”:

returns
Basic arithmetic operations can be applied directly to a numeric vector.
Basic operations are addition , subtraction −, multiplication and
division / , integer division , modulo operation , and
exponentiation with two possible notations: or . Examples are given as:

The last four commands show the “recycling property” of R . It tries to


match the vectors with respect to the length if possible. In fact,

is expanded to

The last example shows that R gives a warning if the length of the shorter
vector cannot be expanded to the length of the longer vector by a simple
multiplication with a natural number ( ). Here
is expanded to

such that not all elements of

are “recycled”.

More on indexing
The standard ways of accessing/indexing elements in a vector, matrix, list, or
data frame have already been introduced above, but R allows a lot more flexible
accessing of elements.

1. Selecting elements using vectors of positive numbers ( letters and


LETTERS show the 26 letters of the alphabet)

2. Selecting elements using logical vectors

returns the output


3. Selecting (deleting) elements using negative numbers

returns the output

because the first five elements have been removed.

4. Selecting elements using characters

returns the output

Standard Functions
Some standard functions and their roles in R are
All functions can again be applied directly to numeric vectors.

Statistical Functions
Some statistical functions and their roles in R are

Note : the arguments of the functions vary depending on the chosen method. For
example, the mean() function can be applied to general R objects where
averaging makes sense (numeric or logical vectors, but also, e.g. matrices). The
functions var(), cov(), cor() expect one or two numeric vectors,
matrices, or data frames. Minimum and maximum functions work also with a
comma-separated list of values, i.e.

provides the same result ( 1.2 ) as

Examples:

Note that var(), cov() use the factor for the unbiased estimate of
the variance instead of 1 / n for the empirical variance, i.e. 1 / 3 in the example
above. Both functions can also be applied to several vectors or a matrix. Then
the covariance matrix (and correlation matrix in case of cor() ) is computed.
For example, consider two variables

Then both commands return the symmetric covariance matrix (with variances as
the diagonal entries and covariances as the non-diagonal entries).
The (Pearson) correlation between the two variables is calculated as 0.9954293.

The Spearman rank correlation is perfectly 1, since both vectors are in increasing
order:

More Useful Functions


Some more commonly used standard functions and their roles in R are as
follows:
Cumulative sum and product:

give the output

Factorial:
returns 5! as

Binomial coefficient :

returns as

Mathematical Constants
The number is a mathematical constant, the ratio of a circle’s circumference to
its diameter, and is commonly approximated as 3.14159. It is directly available
in R as pi .

Other “constants” are

Assignment Operator for Creating Functions


The assignment operator (“less than” sign followed by hyphen) has already
been introduced above in the context of variables. Alternatively, (equality
sign) can be used. One can create one’s own functions: the function is an object
with a name which takes values specified in the round brackets and returns what
has been specified in the curly braces. For example, the following function
myfunction returns a polynomial of degree 3 for a given argument x . Note
that by default all four coefficients are equal to 1.
Loops and Conditions
The concept of loops is convenient when some operation has to be repeated.
Loops can be utilized in various ways, for example, via for or while .
Conditions are specified with the if statement. For example,

returns

In this example, x is a vector with 10 elements: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. For


each element x[i] , we replace x[i] with x[i]+i if the condition x[i]>5
is true; otherwise we do not.

Statistical Functions
Now we consider some basic statistical functions in R . For illustration, we use
the painters data in the following example. This data is available after
loading the library MASS (only a subset is shown below). The data lists the
subjective assessment, on a 0 to 20 integer scale, of 54 classical painters. The
painters were assessed on four characteristics: composition, drawing, colour, and
expression. The data is due to the eighteenth-century art critic, de Piles. Use ?
painters for more information on the data set.
shows

The Summary Function


The summary function allows a quick overview of a data frame. For numeric
variables, the five-point summary (which is also used in a simple box plot, see
Sect. 3.3 ) is calculated together with the arithmetic mean. For factor variables,
the absolute frequencies of the most frequent categories are printed. If the factor
has more than six categories, the other categories are summarized in a separate
category— Other .

yields

The summary function can also be applied to a single variable:

returns

Accessing Subgroups in Data Frames


Subgroups, i.e. groups of observations which share the same feature(s) of one or
several variables, can be accessed using the subset command.
accesses all painters for which School==“F” holds.

This is a more elegant method than selecting these observations by specifying a


condition in squared brackets via the [rows,columns] argument.

Note that the explained structure is an important one: we access the rows and
columns of a matrix or data set by using the [rows,columns] argument.
Here we access all rows for which the variable “school” is “F”. If, in addition,
we also want to restrict the data set to the first two variables, we can write:

Similarly,

gives the output

Uninteresting columns can be eliminated using negative indexing. For


instance, in the following example,
the third and the fifth columns (Colour and School) are not shown.
The operator %in% allows for more complex searches. For instance,

returns the following output:

i.e. those painters with a drawing score between 6 and 9 (= any number
which matches 6, or 7, or 8, or 9).

Stratifying a Data Frame and Applying Commands to a


List
Sometimes it is of interest to apply statistical commands (such as summary ) to
several subgroups. If this is the case, the data is partitioned into different groups
using split and then lapply applies a function to each of these groups. The
command split partitions the data set by values of a specific variable. For
example, we first stratify the painters data with respect to the painter’s
school:
Note, that splitted is now a list,

returns

while the objects splitted$A to splitted$H are data frames.

returns

Secondly, as indicated above, the command lapply allows us to apply a


function to a list. For instance,
applies the summary function to all data frames in the list splitted (output
not shown). See also ?apply, ?sapply, ?tapply , and ?mapply for
similar operations.

Sorting, Ranking, Finding Duplicates, and Unique Values


Sorting a vector:

returns the ordered values in decreasing order as

See also the command order for showing the order of vector elements.
Calculating ranks;

returns the following output:

Finding duplicate values:

indicates which values occur more than once:

Removing duplicates:

shows the output as


This means unique finds out how many different values a vector has.

Categorizing Numeric Variables


Continuous variables (vectors) can be categorized using the cut command.

returns

Random Variables
R has built-in functions for several probability density/mass functions
(PMF/PDF), (probability) distribution function (i.e. the CDF), quantile
functions and for generating random numbers.
The function names use the following scheme:

Examples:

returns the value of the density function (i.e. ) of a N (0, 1)-


distribution at , which is .
returns the value of the CDF of a N (0, 1)-distribution at q , i.e.
.

returns the value

which is the quantile of a N (0, 1)-distribution.

returns a vector of four normal random numbers of a N (0, 1)-distribution:

Note that a repeated application of this function leads to different random


numbers. To get a reproducible sequence of random numbers, a seed value
needs to be set:

If all three commands are executed, then the sequence is (using the standard
random generator)

The following functions for distributions can be used:


Model distributions

Function Distribution
beta Beta
binom Binomial
cauchy Cauchy
exp Exponential
gamma Gamma
geom Geometric
hyper Hypergeometric
lnorm Log–normal
norm Normal
pois Poisson
unif Uniform
mvnorm Multivariate normal (in package mvtnorm )

Test distributions

Function Distribution
chisq

f F
signrank Wilcoxon signed rank (1 sample)
t t
wilcox Wilcoxon rank sum (2 samples)

For convenience, we list a few important PDF and CDF values in Sect. C .

Key Points and Further Issues


Data Sets
From the data used in this book, we publish some relevant data sets, along with
solutions of the R -exercises, on https://chris.userweb.mwn.de/book/ . The
important data sets are explained in the following paragraphs.

Pizza Delivery Data


The pizza delivery data ( pizza_delivery.csv , see also Table A.1 ) is a
simulated data set. The data refers to an Italian restaurant which offers home
delivery of pizza. It contains the orders received during a period of one month:
May 2014. There are three branches of the restaurant. The pizza delivery is
centrally managed: an operator receives a phone call and forwards the order to
the branch which is nearest to the customer’s address. One of the five drivers
(two of whom only work part time at the weekend) delivers the order. The data
set captures the number of pizzas ordered as well as the final bill (in €) which
may also include drinks, salads, and pasta dishes. The owner of the business
observed an increased number of complaints, mostly because pizzas arrive too
late and too cold. To improve the service quality of his business, the owner
wants to measure (i) the time from call to delivery and (ii) the pizza temperature
at arrival (which can be done with a special device). Ideally, a pizza arrives
within 30 min of the call; if it takes longer than 40 min, then the customers are
promised a free bottle of wine (which is not always handed out though). The
temperature of the pizza should be above 65 C at the time of delivery. The
analysis of the data aims to determine the factors which influence delivery time
and temperature of the pizzas.
Table A.1 First few rows of the pizza delivery data

Decathlon Data
This data ( decathlon.csv , see also Table A.2 ) describes the results of the
decathlon competition during the 2004 Olympic Games in Athens. The
performance of all 30 athletes in the 100 m race (in seconds), long jump (in
metres), shot-put (in metres), high jump (in metres), 400 m race (in seconds),
110 m hurdles race (in seconds), discus competition (in metres), pole vault (in
metres), javelin competition (in metres), and 1500 m race (in seconds) are
recorded in the data set.
Table A.2 First few rows of the decathlon data from the 2004 Olympic Games in Athens data
Theatre Data
This data ( theatre.csv , see also Table A.3 ) summarizes a survey
conducted on 699 participants in a big Swiss city. The survey participants are all
frequent visitors to a local theatre and were asked about their age, sex (gender,
female 1), annual income (in 1000 SFR), general expenditure on cultural
activities (“Culture”, in SFR per month), expenditure on theatre visits (in SFR
per month), and their estimated expenditure on theatre visits in the year before
the survey was done (in SFR per month).
Table A.3 First few rows of the theatre data

Solutions to Exercises B
Solutions to Chapter 1
Solution to Exercise 1.1

(a) The population consists of all employees of the airline. This may include
administration staff, pilots, stewards, cleaning personnel, and others. Each
single employee relates to an observation in the survey.

(b) The population comprises all students who take part in the examination.
Each student represents an observation.

(c) All people suffering high blood pressure in the study area (city, province,
country, ), are the population of interest. Each of these persons is an
observation.

Solution to Exercise 1.2


The population in this study refers to all leopards in the park. Only a few of the
leopards are equipped with the GPS devices. This is the sample on which the
study is conducted in. Each leopard refers to an observation . The measurements
are taken for each leopard in the sample. The GPS coordinates allow to
determine the position during the entire day. Important variables to capture
would therefore be “latitude”, “longitude”, and “time”. Each
variable would take on certain values for each observation; for example, the first
leopard may have been observed at latitude at a certain time point, and thus
.

Solution to Exercise 1.3

Solution to Exercise 1.4


(a) The choice of a political party is measured on a nominal scale. The names
of the parties do not have a natural order.

(b) Typically the level of a computer game is measured on an ordinal scale: for
example, level 10 may be more difficult than level 5, but this does not
imply that level 10 is twice as difficult as level 5, or that the difference in
difficulty between levels 2 and 3 is the same as the difference between
levels 10 and 11.

(c) The production time of a car is measured on a continuous scale (ratio


scale). In practice, it may be measured in days from the start of the
production.

(d) This variable is measured on a continuous scale (ratio scale). Typically, the
age is captured in years starting from the day of birth.

(e) Calender year is a continuous variable which is measured on an interval


scale. Note that the year which we define as “zero” is arbitrary, and it
varies from culture to culture. Because the year zero is arbitrary, and we
also have dates before this year, the calender year is measured on an
interval scale.

(f) The scale is continuous (ratio scale).

(g) The scale of ID numbers is nominal. The ID number may indeed consist of
numbers; however, “112233” does not refer to something half as
much/good as “224466”. The number is descriptive.

(h) The final rank is measured on an ordinal scale. The ranks can be clearly
ordered, and the participants can be ranked by using their final results.
However the first winner may not have “double” the beauty of the second
winner, it is merely a ranking.

(i) The intelligence quotient is a variable on a continuous scale. It is


constructed in such a way that differences are interpretative—i.e. being 10
points above or 10 points below the average score of 100 points means the
same deviation from the average. However, ratios cannot be interpreted, so
the intelligence quotient is measured on an interval scale.

Solution to Exercise 1.5

(a) The data is provided in .csv format. We thus read it in with the
read.csv() command (after we have set a working directory with
setwd() ):

(b) The data can be viewed by means of the fix() or View() command or
simply being printed:

(c) We can access the data, as for any matrix, by using squared brackets [, ],
see also Appendix A.1. The first entry in the brackets refers to the row and
the second entry to the columns. Each entry either is empty (referring to
every row/column) or consists of a vector or sequence describing the
columns/rows we want to select. This means that we obtain the first 5 rows
and variables via pizza[1:5,1:5] . If we give the new data the name
“pizza2” we simply need to type:

We can save this new data either as a .dat file (with write.table() ),
or as a .csv file (with write.csv() ), or directly as an R data file (with
save() ) which gives us access to our entire R session.
(d) We can access any variable by means of the $ sign. If we type pizza$new
we create a new variable in the pizza data set called “new”. Therefore, a
simple way to add a variable to the data set is as follows:

(e)

(f) We can apply all these commands onto the object “pizza”. The command
str(pizza) gives us an overview of the data set, see also Fig. B.1 . The
output shows that the data set contains 1266 observations (deliveries) and
13 variables. We can also see which of these variables are factors
(categorical with defined categories) and which are numerical. We also see
the first actual numbers for the each variable and the coding scheme used
for the categories of the factor variables. The command dim summarizes
the dimension of the data, i.e. the number of rows and columns of the data
matrix. Colnames gives us the names of the variables from the data set,
and so does names . The commands nrow and ncol give us the number
of rows and columns, respectively. Applying head and tail to the data
prints the first and last rows of the data, respectively.
Fig. B.1 Applying str() to the pizza data

Solution to Exercise 1.6

(a) The appropriate study design is a survey. The information would be


obtained via a questionnaire given to a sample of parents. It is not a
controlled experiment because we do not manipulate one particular
variable, while controlling others; we rather collect data on all variables of
interest.

(b) There are different options to ask for parents’ attitudes: of course one could
simply ask “what do you think of immunization?”; however, capturing long
answers in a variable “attitude” may make it difficult to summarize and
distil the information obtained. A common way to deal with such variables
is to translate a concept into a score: for example, one could ask 5
“yes/no”-type questions (instead of one general question) which relate to
attitudes towards immunization, such as “do you think immunization may
be harmful for your child?” or “do you agree that it is a priority to
immunize infants in their first year of life?” The number of answers that
show a positive attitude towards immunization can be summed up. If there
are 5 questions, there are up to 5 points “to earn”. Thus, each parent may
be asked 5 questions and his/her attitude can be summarized on a scale
ranging from 0 to 5, depending on the answers given.

(c) The following variables are needed:


Attitude: possibly several variables are needed to capture parents’
information in a score, see (b) for details. The scale is ordinal because
a higher score relates to a more positive attitude towards
immunization, but the differences between different score levels
cannot be interpreted meaningfully.
Immunized: a binary (“yes–no” type) variable capturing whether the
parent agrees to immunization against chickenpox for their youngest
child or not. This is a nominal variable.
Gender: to compare “Immunized” for male and female parents. This is
a nominal variable.
Age: to compare the age distribution in the group of parents who
would immunize their child with the age distribution in the group who
would not immunize their child. Age is measured on a continuous
scale.

(d) A data set might look as follows:

where refer to variables capturing attitudes towards


immunization and “Attitude” is the score variable summarizing these
questions. The questions may be written down as follows:
What is the average attitude score towards immunization among
parents and how much does it vary?
What percentage of parents answer “yes” when asked whether they
would immunize their youngest child against chickenpox?
What is the difference in the proportion calculated in (b) when
stratified by gender?
What is the average age of those parents who would immunize their
child compared with the average age of those parents who would not
immunize their child?
Chapter 2
Solution to Exercise 2.1

(a) The table shows the relative frequencies of each party and not the absolute
frequencies. We can thus draw a bar chart where the relative frequencies of
votes are plotted on the y -axis and different parties are plotted on the x -
axis. In R , we can first type in the data by defining two vectors and then
use the “barplot” command to draw the bar chart (Fig. B.2 a). Typing “?
barplot” and “?par” shows the graphical options to improve the
presentation and look of the graph:

(b) There are several options to compare the results. Of course, one can simply
plot the two bar charts with each bar chart representing one election. It
would be important for this solution to ensure that the y -axes are identical
in both the plots. However, if we want to compare the election results in
one graph then we can plot the difference between the two elections, i.e.
the win/loss per party. The bar chart for the new variable “difference in
proportion of votes between the two elections” is shown in Fig. B.2 and is
obtained as follows:
Fig. B.2 Bar charts for national elections in South Africa

Remark
Another solution would be to create subcategories on the x -axis: for example,
there would be categories such as “ANC 2009 results” and “ANC 2014 results”,
followed by “DA 2009 results” and “DA 2014 results”, and so on.

Solution to Exercise 2.2

(a) The scale of X is continuous. However, please note that the number of
values X can practically take is limited (90 min plus extra time, recorded in
1 min intervals).

(b) It is straightforward to obtain the frequency table, as well as the other


information needed to obtain the histogram and the ECDF, see Table B.1 .

Table B.1 Frequency table and other information for the variable “time until first goal”

j F(x)

1 [0, 15) 19 15

2 [15, 30) 17 15

3 [30, 45) 6 15

4 [45, 60) 5 15
5 [60, 75) 4 15

6 [75, 90) 2 15

7 [90, 96) 2 6 1

Total 56 1

Fig. B.3 Distribution of time to goal

Fig. B.4 Empirical cumulative distribution function for the variable “time until first goal”
(c) We need to obtain the heights for each category to obtain the histogram using
, see Table B.1 .

(d) We obtain the histogram and kernel density plot in R (Fig. B.3 a) using the
commands

(e) The ECDF values for F ( x ) are calculated using the relative frequencies f ( x
), see Table B.1 .

(f) (i) We can easily plot the ECDF (Fig. B.4 a) for the original data using the
R command

(ii) Generating the ECDF for the grouped data requires more effort and is
not necessarily intuitive: first we categorize the continuous variable
using the function cut . We use the label option to indicate that the
name of each category corresponds to the upper limit of the respective
interval. This new variable is a “factor” variable and the plot.ecdf
function does not work with this type of variable. We need to first
change the “factor” variable into a “character” variable with strings
corresponding to the labels and coerce this into numeric values. Then
we use plot.ecdf , see also Fig. B.4 b. Alternatively, we can
directly replace the raw values with numbers corresponding to the upper
interval limits.
(g) To solve the exercises, we simply use Formula ( 2.11 ) and Rules ( 2.3 ff.)

(i) .

(ii) .

(iii)
.

(h) We know from ( 2.11 ) that

with which relates to

We are interested in because 80 % of the matches have seen a goal at this


time point:

We conclude that 80 % of the “first goals” happened up to 50.4 min.

Solution to Exercise 2.3

(a) We obtain the relative frequencies for the first and fourth intervals as 0.125
( ). Accordingly, for the other two intervals, we obtain
frequencies of .

(b) We obtain the absolute frequencies for the first and fourth intervals as 250 (
). For the other intervals, we obtain 750 ( ).
Solution to Exercise 2.4

(a) The absolute frequencies are evident from the following table:

1 8 14 0.25 0.25 6 11

2 14 22 0.40 0.15 75 8 18
3 22 34 0.75 0.35 175 12 28
4 34 50 0.97 0.22 110 16 42
5 50 82 1.00 0.03 15 32 66

(b) We obtain .

Solution to Exercise 2.5

(a) The data needed to calculate and draw the ECDF is summarized in Table
B.2 ; the ECDF is plotted in Fig. B.5 .

Table B.2 Information needed to calculate the ECDF

Score 1 2 3 4 5 6 7 8 9 10
Results 1 3 8 8 27 30 11 6 4 2

1
Fig. B.5 Empirical cumulative distribution function for the variable “score”

(b) It follows from Table B.2 that and .

(c) The grey solid line in Fig. B.5 summarizes the ECDF for the grouped data.
It goes from (0, 0) to (1, 1) with a breakpoint at (5, 0.47) since
summarizes the information for the group “disagree”. Using ( 2.11 ) we
calculate:

(d) The results of (b) and (c) are very different. The formula applied in (c)
assumes that the values in each category are uniformly distributed, i.e. that
within each category, each value occurs as often as each other value.
However, we know from (a) that this is not true: there are more values
towards the central score numbers. The assumption used in (c) is therefore
inappropriate as also highlighted in Fig. B.5 .

Solution to Exercise 2.6


We read in and attach the data as follows:
(a) We need the options ylim, xlim, ylab, xlab, col to adjust the
histogram produced with hist() . We then add a dashed ( lty=2 ) line (
type=’l’ ), which is thick ( lwd=3 ), from (65, 0) to (65, 400) using the
lines() command. See also Fig. B.6 a.

(b) We can create the histogram as follows, see also Fig. B.6 b:

(c) A possible solution is as follows, see also Fig. B.6 c:

(d) We can produce the graph (Fig. B.6 d) as follows:


Fig. B.6 Creating figures with R
Chapter 3
Solution to Exercise 3.1

(a) The arithmetic means can be calculated as follows:

The ordered values of the two variables are:

(b) We have which is not an integer. We can therefore


calculate the 25 % quantiles, i.e. the first quartiles, as

Similarly, is not an integer and thus

One can see that the distributions for both variables are not symmetric. For
example, when looking at the distance hiked, the difference between the
median and the first quartile ( ) is much larger than the
difference between the median and the third quartile ( ); this
indicates a distribution that is skewed towards the left.

(c) We can calculate the interquartile ranges as follows:


The mean absolute deviations are:

The variances of both the variables are

The standard deviations are therefore and .

(d) To answer this question, we can use the rules of linear transformations.

(e) To draw the box plots, we can use the results from (a), (b), and (c) in this
solution. The median, first quartile, third quartile, and the interquartile
range have already been calculated. It is also easy to determine the
minimum and maximum values from the table of ordered values in (a).
What is missing is the knowledge of whether to treat some of the values as
extreme values or not. For the distance hiked, an extreme value would be
any value or . It follows that there
is only one extreme value: 29.9 km. For the maximum altitude, there is no
extreme value because there is no value or
. The box plots are shown in Fig. B.7 a, b.
Fig. B.7 Box plots for Exercise 3.1

(f) The data can be summarized as follows:

Class intervals (5; 15] (15; 20] (20; 30]


4 4 2

4/10 4/10 2/10

4/10 8/10 1

We can calculate the weighted arithmetic mean by using the relative


frequencies and the middle of the intervals :

To estimate the weighted median, we need to determine the class for which

holds. This is clearly the case for the second class . Thus

(g) If the raw data is known, then the variance for the grouped data will be
identical to the variance calculated in (c). For educational purposes, we
show the identity here. The variance for grouped data can be calculated as:

Using the arithmetic mean as well as the means in each class,


, and , we can calculate the variance
between the classes:

The variances within each class are:

We thus get

The total variance is therefore . The


results will typically differ if we do not know the raw data: we have to
replace the arithmetic means within each class, , with the middle of each
class , i.e. :

We further get

and
The variance is . The approximation is
therefore good. However, please note that the between-class variance was
estimated too low, but the within-class variance was estimated too high;
only the combination of the two variance components led to reasonable
results in this example. It is evident that the approximation in the third
class was not ideal. The middle of the interval, 25, was not a good proxy
for the true mean in this class, 28.65.

(h) It is easy to calculate the mean and the median:

We can use the quantile function, together with the probs option, to
get the quantiles:

However, the reader will see that the results differ slightly from our results
obtained in (b). As noted in Example 3.1.5 , R offers nine different ways to
obtain quantiles, each of which can be chosen by the type argument. The
difference between these options cannot be understood easily without a
background in probability theory. It may, however, be worth highlighting
that we get the same results as in (b) if we choose the type=2 option in
this example. The interquartile ranges can be calculated by means of the
difference of the quantiles obtained above. To determine the mean absolute
deviation, we have to program our own function:
We can calculate the variance using the var command. However, as noted
in Example 3.2.4 , on p. 52, R uses rather than 1 / n when
calculating the variance. This important alternative formula for variance
estimation is explained in Chap. 9, Theorem 9.2.1 . To obtain the results
from (c), we hence need to multiply the output from R by :

The box plots can be drawn by using the boxplot command:

The weighted mean is obtained as follows:

Solution to Exercise 3.2

(a) We need to solve the equation that defines the arithmetic mean:

(b) The mode is . Using and , respectively, we can


determine the quartiles as follows:
The interquartile range is .

(c) It is not possible to use the coefficient of variation because some of the
values are negative.

Solution to Exercise 3.3


We calculate

Using the formula for the arithmetic mean for grouped data,

we further get

Similarly, we can use the Theorem of Variance Decomposition, Eq. ( 3.27 ), to


calculate the variance for the grouped data:

This yields

Solution to Exercise 3.4

(a) We evaluate a period of 6 years which means that . To


determine the average growth rate, we need to first calculate the geometric
mean. There are two options to facilitate this:
(i) Solution:

(ii) Easier solution:

Since , the average growth rate is %.

(b) To predict the number of members in 2018, we could apply the average
growth rate to the number of members in 2016 for two consecutive years:

(c) We could use the approach outlined in (b) to predict the number of
members in 2025. However, this assumes that the average growth rate
between 2011 and 2016 remains valid until 2025. This is rather unrealistic.
The number of members of the club increases in some years, but decreases
in other years. There is no guarantee that the pattern observed until 2016
can be generalized to the future. This highlights that statistical
methodology should in general be used with care when making long-term
future predictions.

(d) The invested money is € 250 million. We deal with partially


grouped data because the club’s members invest in groups, but the invested
sum is clearly defined and not part of a group or interval. We can thus
summarize the data as follows:

(i) 1 2 3 4
Number of members 10 8 8 4
Rel. number of members 10/30 8/30 8/30 4/30

10/30 18/30 26/30 1

Money invested 40 60 70 80
Rel. amount per group 40/250 60/250 70/250 80/250

40/250 100/250 170/250 1

The Lorenz curve is plotted in Fig. B.8 .

Fig. B.8 Lorenz curve

(e) The Gini coefficient can be calculated using formula ( 3.37 ) as follows:

The concentration of investment is thus rather weak.

Solution to Exercise 3.5


Let us look at Fig. 3.8 b first. The quantiles coincide approximately with the
bisection line. This means that the distribution of “length of service” is similar
for men and women. They have worked for about the same amount of time for
the company. For example, the median service time should be approximately the
same for men and women, the first quartile should be very similar too, the third
quartile should be similar too, and so on. However, Fig. 3.8 a shows a somewhat
different pattern: the quantiles for men are consistently higher than those for
women. For example, the median salary will be higher for men. In conclusion,
we see that men and women have a similar length of service in the company, but
earn less.

Solution to Exercise 3.6


There are many ways in which a “mode” function can be programmed. We
present a simple and understandable solution, not the most efficient one. Recall
that table constructs a frequency table. This table shows immediately which
value(s) occur(s) most often in the data. How can we extract it? Applying the
names function on the table returns the names of the values represented in the
table. The only thing we need to do is to choose the name which corresponds to
the value which has been counted most. This yields the following function:

This function will work in general, though it returns a character vector. Using
as.numeric is one option to make the character string numeric, if necessary.

Solution to Exercise 3.7

(a) In this exercise, we do not have individual data; i.e. we do not know how
much each inhabitant earns. The summarized data simply tells us about the
wealth of two groups. For simplicity and for illustrative purposes, we
assume that the wealth is equally distributed in each group. We determine
as (0.8, 0.1) and (1, 1) because 80 % of the population earn 10 % of
the wealth and 100 % of the population earn everything. The respective
Lorenz curve is illustrated in Fig. B.9 a.
Fig. B.9 Lorenz curves

(b) The upper class lost its wealth. This means that 20 % of the population do
not own anything at all. However, the remaining 80 % owns the rest. This
yields of (0.2, 0) and (1, 1), see also Fig. B.9 b.

(c) In this scenario, 20 % of the population leave the country. However, the
remaining 80 %—which are now 100 % of the population—earn the rest.
The money is equally distributed between the population. Figure B.9 c
shows this situation.

Solution to Exercise 3.8

(a) It is necessary to use the harmonic mean to calculate the average speed.
Using , and
we get

(b) Travelling at 45.2 km/h means travelling about 361 km in 8 h. The bus will
not be in time.

Solution to Exercise 3.9


(a) The sum of investment is €18, 020. To calculate and draw the
Lorenz curve, we need the following table:

The curve is presented in Fig. B.10 .

(b) We can calculate the Gini coefficient as follows:

(c) The Gini coefficient remains the same as the relative investment stays the
same.

(d) Using the library ineq we can easily reproduce the results in R :

However, please note that the ineq command calculates the


unstandardized Gini coefficient.

Solution to Exercise 3.10

(a) The easiest way to get all these measures is to use the summary function
and apply it to the data columns which relate to quantitative variables:
We then get the following output:

(b) We can use the quantile function:

The results are 48.62 min for delivery time and 79.87 C for temperature.
This means 99 % of the delivery times are less than or equal to 48.62 min
and 1 % of deliveries are greater than or equal to 48.62 min. Similarly, only
1 % of pizzas were delivered with a temperature greater than 79.87 C.

(c) The following simple function calculates the absolute mean deviation:

(d) We can use the scale , mean , and var commands, respectively.

As one would expect, the mean is zero and the variance is 1 for the scaled
variable.
(e) The boxplot command draws a box plot; the range option specifies the
range for which extreme values are defined. As specified in the help files,
range=0 produces a box plot with no extreme values.

The box plots are displayed in Fig. B.11 .

(f) We use the cut command to create a variable which has the categories
, respectively. Using the interval mid-
points, as well as the relative frequencies in each class (obtained via the
table command), we get:

The weighted mean is very similar to the mean from the raw data, see
output above.

(g) The plots can be reproduced by using the qqplot command:


Fig. B.10 Lorenz curve for investment sum

Fig. B.11 Box plots


Chapter 4
Solution to Exercise 4.1

(a) We need the following table to calculate the correlation coefficient R :

Café ( i )

1 3 1 6 2 −1 1
2 8 4 7 3 1 1
3 7 3 10 5 −2 4
4 9 5 8 4 1 1
5 5 2 4 1 1 1

There is a moderate-to-strong positive correlation between the ratings


of the two coffee enthusiasts. In general, a high rating from one staff
member implies a rather high rating from the other staff member.

(b) Above we have assigned ranks in an increasing order; i.e. the lowest
gets the lowest rank (1) and the highest gets the highest rank (5). If we
use decreasing order and assign the lowest rank to the highest values, we
get the following results:

Café ( i )

1 3 5 6 4 1 1
2 8 2 7 3 −1 1
3 7 3 10 1 2 4
4 9 1 8 2 −1 1
5 5 4 4 5 −1 1

As in (a), we have and therefore, the results are identical: .


Depending on whether ranks are assigned in an increasing order or a
decreasing order, the sign of differs, but the calculation of R is not
affected since the squared values of are used for its calculation and the
sign of is thus not important.

(c) We can summarize the ratings in a table:

XY
Coffee Bad 2 1
Quality Good 3 4

The odds ratio is . The chance of rating a coffee as


good is twice as likely for person X compared to person Y .

Solution to Exercise 4.2

(a) The expected frequencies (under independence) are:

Satisfied Unsatisfied
Car

Car (diesel engine)

Motorbike

We therefore have

The maximum value can take is which indicates that


there is almost no association between type of vehicle and satisfaction with
the insurance. The other measures can be calculated as follows:
:

The small values of V and confirm that the association is rather weak.

(b) The summarized table looks as follows:

Satisfied Unsatisfied
Car 62 56
Motorbike 12 20

Using ( 4.7 ), we obtain

The maximum value can take is . The association is therefore


weak. The odds ratio is

The chances of being satisfied with the insurance are 1.845 times higher
among those who drive a car.

(c) All -based statistics suggest that there is only a small association
between the two variables, for both the and the tables. However,
the odds ratio gives us a more nuanced interpretation, showing that
customers driving a car are somewhat more satisfied with their insurance.
The question is whether the additional information from the odds ratio is
stable and trustworthy. Confidence intervals for the odds ratio can provide
guidance under such circumstances, see Sect. 9.4.4 for more details.
Fig. B.12 Scatter diagram for speed limit and number of deaths

Solution to Exercise 4.3

(a) The scatter plot is given in Fig. B.12 . The black circles show the five
observations. A positive relationship can be discovered: the higher the
speed limit, the higher the number of deaths. However, “Italy” (the
observation on the top right) is the observation which gives the graph a
visible pattern and drives our impression about the potential relationship.

(b) Using and we obtain

and therefore

The correlation coefficient of Spearman is calculated using the following


table:

Country ( i )

Denmark 55 4.5 4.1 5 −0.5 0.25


Japan 55 4.5 4.7 3 1.5 2.25
Canada 60 2.5 4.3 4 −1.5 2.25
Netherlands 60 2.5 5.1 2 0.5 0.25
Italy 75 1 6.1 1 0 0

This yields

Please note that above we averaged the ranks for ties. The R function cor
uses a more complicated approach; this is why the results differ slightly
when using R .

(c) The results stay the same. Pearson’s correlation coefficient is invariant with
respect to linear transformations which means that it does not matter
whether we use miles/h or km/h.

(d) (i) The grey square in Fig. B.12 represents the additional observation.
The overall pattern of the scatter plot changes with this new
observation pair: the positive relationship between speed limit and
number of traffic deaths is not so clear anymore. This emphasizes
how individual observations may affect our impression of a scatter
plot.

(ii) Based on the same approach as in (b), we can calculate


, and .A
single observation changes our conclusions from a strong positive
relationship to a moderate-to-weak relationship. It is evident that
Pearson’s correlation coefficient is volatile and may be affected
heavily by outliers or extreme observations.

Solution to Exercise 4.4

(a) The contingency table for the absolute frequencies is as follows:

1. Class 2. Class 3. Class Staff Total


Rescued 202 125 180 211 718
Not rescued 135 160 541 674 1510
Total 337 285 721 885 2228

(b) To obtain the conditional relative frequency distributions, we calculate the


proportion of passengers rescued ( X ) for each travel class ( Y ). In the
notation of Definition 4.1.1 , we determine for all i
and j . For example, . This yields

1. Class 2. Class 3. Class Staff


Rescued 0.5994 0.4386 0.2497 0.2384
Not rescued 0.4006 0.5614 0.7503 0.7616

It is evident that the proportion of passengers being rescued differs by


travel class. It can be speculated that there is an association between the
two variables pointing towards better chances of being rescued among
passengers from higher travel classes.

(c) Using ( 4.3 ), we get

1. Class 2. Class 3. Class Staff Total


Rescued 108.6 91.8 232.4 285.2 718
Not rescued 228.4 193.2 488.6 599.8 1510
Total 337 285 721 885 2228

which can be used to calculate and V as follows:

The value of V indicates a moderate or weak relationship between the two


variables. This is in contradiction to the hypothesis derived from the
conditional distributions in (b).

(d) The table is as follows:

1. Class/2. Class 3. Class/Staff Total


Rescued 327 391 718
Not rescued 295 1215 1510
Total 622 1606 2228

Using ( 4.7 ) we get

and therefore .
There are several relative risks that can be calculated, for example:

The proportion of passengers who were rescued was 2.16 times higher
in the 1./2. class compared to the 3. class and staff. Similarly, the
proportion of passengers who were not rescued was 0.62 times lower in the
1./2. class compared to the 3. class and staff. The odds ratio is
. This is nothing but the ratio of the relative risks,
i.e. 2.16 / 0.63. The chance of being rescued (i.e. the ratio rescued/not
rescued) was almost 3.5 times higher for the 1./2. class compared to the 3.
class and staff.

(e) While Cramer’s V led to rather conservative conclusions regarding a


possible relationship of travel class and rescue status, the conditional
relative frequency distribution, the relative risks, and the odds ratio support
the hypothesis that, at least to some degree, the chances of being rescued
were higher in better travel classes. This makes sense because better travel
classes were located towards the top of the ship with best access to the
lifeboats while both third-class passengers and large numbers of the staff
were located and working in the bottom of the ship, where the water came
in first.

Solution to Exercise 4.5

(a) Using ( 4.17 ) and ( 4.18 ), we get

This indicates that there is no linear relationship between temperature and


hotel occupancy.

(b) The scatter plot shows no clear pattern. This explains why the correlation
coefficient is close to 0. However, if we look only at the points for each
city separately, we see different structures for different cities: a possible
negative relationship for Davos (D), a rather positive relationship for
Polenca (P) and no visible relationship for Basel (B). This makes sense
because for winter holiday destinations hotel occupancy should be higher
when the temperature is low and for summer holiday destinations
occupancy should be high in the summer months.

(c) We type in the data by specifying three variables: temperature ( X ),


occupancy ( Y ) and city ( Z ). We then simply use the cor command to
calculate the correlation—and condition on the values of Z which we are
interested in:

This yields correlation coefficients of for Davos, 0.42 for Basel and
0.82 for Polenca. It is obvious that looking at X and Y only indicates no
correlation, but the information from Z shows strong linear relationships in
subgroups. This example shows the limitations of using correlation
coefficients and the necessity to think in a multivariate way. Regression
models offer solutions. We refer the reader to Chap. 11 , in particular Sect.
11.7.3 for more details.

Solution to Exercise 4.6

(a) We use the visual rule of thumb and work from the top left to the bottom
right for the concordant pairs and from the top right to the bottom left for
the discordant pairs:

(b)

(c) The table is as follows:

Use a leash
Agree or no. Disagree Total
Use for concerts Agree or no. 137 9 146
Disagree 2 10 12
Total 139 19 158
(d) The relative risk can either be summarized as:

The proportion of those who disagree with using the park for summer
concerts is 0.03 times lower in the group who agree or have no opinion
about using leashes for dogs compared to those who disagree. Similarly,
the proportion of those who disagree with using the park for summer
concerts is 36.6 times higher in the group who also disagree with using
leashes for dogs compared to those who do not disagree.

(e) The odds ratio is .


The chance of not disagreeing with the concert proposal is 36.1 times
higher for those who also do not disagree with the leash proposal.
The chance of not disagreeing with the leash proposal is 36.1 times
higher for those who also do not disagree with the concert proposal.
In simpler words: The chance of agreeing or having no opinion for
one of the questions is 36.1 times higher if the person also has no
opinion or agrees with the other question.

(f)

(g) In general, it makes sense to use all the information available, i.e. to use the
ordinal structure of the data and all three categories. While it is clear that
is superior to V in our example, one may argue that the relative risks or the
odds ratio could be more useful because they provide an intuitive
quantification on how the two variables relate to each other rather than just
giving a summary of strength and direction of association. However, as we
have seen earlier, the interpretations of the relative risks and the odds ratio
are quite clumsy in this example. It can be difficult to follow the somewhat
complicated interpretation. A simple summary would be to say that
agreement with both questions was strongly associated ( ).

Solution to Exercise 4.7


We use the definition of the correlation coefficient, replace with and
replace with to obtain

This equates to:

Solution to Exercise 4.8

(a) We read in the data, make sure the first column is recognized as containing
the row names, attach the data, and obtain the correlation using the cor
command:

The correlation is 0.52984. This means there is a moderate-to-strong


positive correlation; i.e. the longer the distance the discus is thrown, the
higher the height in the high jump competition.

(b) There are 10 variables. For the first variable, we can calculate the
correlation with 9 other variables. For the second variable, we can also
calculate the correlation with 9 other variables. However, we have already
calculated one out of the 9 correlations, i.e. when analysing variable
number one. So it is sufficient to calculate 8 correlations for the second
variable. Similarly, we need another 7 correlations for the third variable, 6
correlations for the fourth variable, and so on. In total, we therefore need to
have correlations. Since the correlation coefficient
describes the relationship between two variables, it makes sense to
summarize the results in a contingency table, similar to a matrix, see Fig.
B.13 .

(c) Using cor(decathlon) yields the correlation coefficients between all


variable pairs. This is much simpler than calculating the correlation
coefficient for each of the 45 comparisons. Note that the correlation matrix
provides the 45 comparisons both in the upper triangle and in the lower
triangle of the table. We know that , but R still provides us
with both, although they are identical. Note that the diagonal elements are
1 because .

(d) One way to omit rows with missing data automatically is to use the
na.omit command:

The results are displayed in Fig. B.13 . We see moderate-to-strong


correlations between the 100 m race, 400 m race, 110 m hurdle race and
long jump. This may reflect the speed-and-athletic component of the
decathlon contest. We also see moderate-to-strong correlations between the
shot-put, high jump, and discus events. This may reflect the strength-and-
technique component of the contest. The other disciplines also show
correlations which make sense, but they are rather weak and may reflect
the uniqueness of these disciplines.

Fig. B.13 Correlation matrix for the decathlon data


Solution to Exercise 4.9

(a) A possible code is listed below:

We can see that there is a higher proportion of high temperature ((65, 100])
in the category of short delivery times ((0, 30]) compared to long delivery
times ((30, 100]).

(b) Using the data from (a), we can calculate the odds ratio:

Thus, the chances of receiving a cold pizza are 0.18 lower if the delivery
time is short.

(c) We use the vcd and ryouready packages to determine the desired
measures of association:

Cramer’s V is 0.361, Stuart’s is , and is . The first two


measures indicate a moderate-to-weak relationship between temperature
and delivery time. It is clear from the last two measures that this
relationship is negative, i.e. that shorter delivery times imply higher
temperatures. However, interestingly, and provide us with a different
strengths of association. In any case, it is clear that for shorter delivery
times the customers receive warmer pizzas, as evident from the stacked bar
chart (Fig. B.14 a).

(d) The scatter plot (Fig. B.14 b) shows a decreasing temperature for an
increasing delivery time. This is also highlighted in the correlation
coefficients which are and for Bravais–Pearson and Spearman,
respectively.

(e) It makes sense to compare continuous variables (temperature, number of


pizzas, bill) using the correlation coefficient from Bravais–Pearson. The
relationships between temperature and driver and operator could be
visualized using stratified box plots.

The correlation coefficients are and for number of pizzas and


bill, respectively. More pizzas and a higher bill are associated with a lower
temperature. The box plots (Fig. B.14 c, d) show variation in the delivery
times of the different drivers, but almost identical delivery times for the
two operators. These results give us a first idea about the relationship in the
data. However, they do not tell us the full story. For example: is the pizza
temperature for higher bills low because a higher bill means more pizzas,
and therefore, a less efficient preparation of the food? Is one driver faster
because he mainly delivers for a particular branch? Could it be that the
operators have a different performance but because they deal with different
branches and drivers, these differences are not visible? To address these
questions, a multivariate perspective is needed. Regression models, as
introduced in Chap. 11 , provide some answers.

Fig. B.14 Plots for Exercise 4.9


Chapter 5
Solution to Exercise 5.1
There are guests and guests shake hands with each other. The order
is not important: two guests shake hands no matter who is “drawn first”. It is
also not possible to shake hands with oneself which means that in terms of the
urn model, we have the “no replacement” case. Therefore, we obtain the solution
as

handshakes in total.

Solution to Exercise 5.2


We assume that it is not relevant to know which student gets tested at which time
point. We are thus dealing with combinations without considering the order. We
have and and therefore obtain:

(a) a total number of = 53,130 possibilities.

(b) a total number of = = 118,755 possibilities.

In R , the commands choose(25,5) and choose(29,5) , respectively


provide the required results.

Solution to Exercise 5.3


The board consists of knots. Each knot is either occupied or not. We may
thus assume a “drawing” without replacement in the sense that each knot can be
drawn (occupied) only once. If we place counters on the board, then we
can simultaneously think of “drawing” 64 occupied knots out of a total of 381
knots. The order in which we draw is not relevant—either a knot is occupied or
not. The total number of combinations is = . We obtain

the final number in R using the command choose(381,64) .


Solution to Exercise 5.4
We obtain the results (again using the command choose(n,m) in R ) as
follows:

(a) The customer takes the beers “with replacement” because the customer can
choose among any type of beer for each position in the tray. One can also
think of an urn model with 6 balls relating to the six different beers, where
beers are drawn with replacement and placed on the tray. The order in
which the beers are placed on the tray is not relevant. We thus have

combinations.

(b) If the customer insists on having at least one beer per brewery on his tray,
then 6 out of the 20 positions of the tray are already occupied. Based on the
same thoughts as in (a), we calculate the total number of combinations as

Solution to Exercise 5.5


Since each team has exactly one final place, we have a “without replacement”
situation. Using and (and choose(n,m) in R ) yields

(a) and

(b)

Solution to Exercise 5.6


There are different letters for positions of the membership code.
Each letter can be used more than once if desired and we thus obtain = =
20,736 possible combinations. We therefore conclude that sufficient membership
codes are left. However, this may not last long and the book store may still wish
to create another solution for its membership codes.
Solution to Exercise 5.7
For each member of the jury, there are 61 scoring options:

Different jury members are allowed to assign the same scores. We thus deal with
combinations “with replacement”. To verify this, just think of an urn with 61
balls where each ball refers to one possible score. Now one ball is drawn,
assigned to a specific jury member and then put back into the urn. Since each
score is “attached” to a particular jury member, we have combinations with
consideration of the order and therefore obtain a total of
possibilities. If you have difficulties in understanding the role of “replacement”
and “order” in this example, recall that each member has 61 scoring options:
thus, (9 times) combinations are possible.

Solution to Exercise 5.8

(a) We obtain:

(b) Based on the observations from (a) we conclude that each entry on the
diagonal line can be represented by . The sum of two consecutive

entries is thus . Using the fourth relation in ( 5.5 ), it follows

that:
Chapter 6
Solution to Exercise 6.1

(a) We obtain
.

.
.

(b) The solutions are:


.

.
because the events are pairwise disjoint.
.
.

Solution to Exercise 6.2


We know that the probability of failing the practical examination is
, of failing the theoretical examination is , and of failing both is
.

(a) If we ask for the probability of failing in at least one examination, we


imply that either the theoretical examination, or the practical examination,
or both are not passed. We can therefore calculate the union of events
.
(b)

(c)

(d) We are interested in the probability of the person failing exactly in one
exam. This corresponds to
.

Solution to Exercise 6.3


The total number of possible simple events is = 12. The number of
favourable simple events is

(a) (i.e. the numbers 2, 4, 6, 8, 10, 12). Hence, .

(b) (i.e. the numbers 10, 11, 12). Hence, .

(c) (i.e. the numbers 10, 12). Hence, .

(d) (i.e. the numbers 2, 4, 6, 8, 10, 11, 12). Hence, .

Solution to Exercise 6.4


The total number of simple events is .

(a) The number of favourable simple events is one and therefore

(b) The number of favourable simple events is because the person draws
two presents out of the ten “wrong” presents:
. In Sect. 8.1.8 , we

explain the hypergeometric distribution which will shed further light on


this exercise.

Solution to Exercise 6.5

(a) Let V denote the event that there is too much salt in the soup and let L
denote the event that the chef is in love. We know that

Similarly, we have

We therefore get:

This yields the following contingency table:

V Total

L 0.18 0.12 0.3


0.02 0.68 0.7
Total 0.2 0.8 1

(b) The variables are not stochastically independent since, for example,
.

Solution to Exercise 6.6

(a) We define the following events: G Basil is treated well, Basil is not
treated well; E Basil survives, Basil dies. We know that
Using the Law of Total Probability, we get

(b) We can use Bayes’ Theorem to answer the question:

Solution to Exercise 6.7


We define the following events and probabilities
A: Bill never paid, .

M: Bill paid too late,

because someone who never pays will always pay too late.

(a) We are interested in P ( M ), the probability that someone does not pay his
bill in a particular month, either because he is not able to or he pays too
late. We can use the Law of Total Probability to obtain the results:

(b) We can use Bayes’ Theorem to obtain the results:

(c) If the bill was not paid in a particular month, the probability is 20.8 % that
it will never be paid, and 78.2 % that it will be paid. One could argue that a
preventive measure that affects almost 79 % of trustworthy customers are
not ideal and the bank should therefore not block a credit card if a bill is
not paid on time.

Solution to Exercise 6.8

(a) The “and” operator refers to the joint distribution of two variables. In our
example, we want to know the probability of being infected and having
been transported by the truck. This probability can be directly obtained
from the respective entry in the contingency table: 40 out of 200 cows fulfil
both criteria and thus

(b) We can use to obtain:

(c) Using these results and and , we obtain

by using the Law of Total Probability. We can thus calculate

This means that the probability of a cow being infected is 30 %.


Alternatively, we could have simply looked at the marginal distribution of
the contingency table to get .

Solution to Exercise 6.9

(a) The two shots are independent of each other and thus
(b) We need to calculate

(c)
Chapter 7
Solution to Exercise 7.1

(a) The first derivative of the CDF yields the PDF, :

(b) We know from Theorem 7.2.3 that for any continuous variable
and therefore . We calculate
.

(c) Using ( 7.15 ), we obtain the expectation as

Given that we have already calculated , we can use Theorem 7.3.1 to


calculate the variance as . The expectation of is

We thus obtain .

Solution to Exercise 7.2

(a) The probability mass function of X is


1 2 3 4 5 6

Using ( 7.16 ), we calculate the expectation as

To obtain the variance, we need

Therefore, using , we get

The manipulated die yields on average higher values than a fair die because
its expectation is . The variability of is, however, similar because
.

(b) The probability mass function of is:

The expectation can hence be calculated as

Comparing the results from (a) and (b) shows clearly that .
Recall that . It is interesting to see that for some
transformations T ( X ) it holds that , but for some it does
not. This reminds us to be careful when thinking of the implications of
transformations.

Fig. B.15 Cumulative distribution function for the proportion of wine sold

Solution to Exercise 7.3

(a) There are several ways to plot the CDF. One possibility is to define the
function and plot it with the curve command. Since the function has
different definitions for the intervals , we need to take
this into account. Remember that a logical statement in R corresponds to a
number, i.e. TRUE = 1 and FALSE = 0; we can thus simply add the
different pieces of the function and multiply them with a condition which
specifies if X is contained in the interval or not (Fig. B.15 ):

(b) The PDF is


(c)

(d) We have already defined the CDF in (a). We can now simply plug in the x -
values of interest:

(e) The variance can be calculated as follows:

Solution to Exercise 7.4

(a) Two conditions need to be satisfied for f ( x ) to be a proper PDF:

(i) :
(ii) :

(b) We calculate

and therefore

(c) The expectation is


Using , we calculate the variance as

(d)

Solution to Exercise 7.5

(a) The marginal distributions are obtained by the row and column sums of the
joint PDF, respectively. For example, .

X Y

0 3/4 1 1/6
1 1/4 2 7 / 12
3 1/4

The marginal distribution of X tells us how many customers sought help via
the telephone hotline (75 %) and via email (25 %). The marginal
distribution of Y represents the distribution of the satisfaction level,
highlighting that more than half of the customers (7/12) were “satisfied”.

(b) To determine the 75 % quantile with respect to Y , we need to find the


value for which and for . Y takes the
values 1, 2, 3. The quantile cannot be because .
The 75 % quantile is because and for
all values which are smaller than 2 we get .

(c) We can calculate the conditional distribution using


. Therefore,

Among those who used the email customer service two-thirds were
unsatisfied, one-third were satisfied, and no one was very satisfied.

(d) As we know from ( 7.27 ), two discrete random variables are said to be
independent if . However, in our
example, . This means that X
and Y are not independent.

(e) The covariance of X and Y is defined as . We


calculate
Since , we conclude that there is a negative relationship
between X and Y : the higher the values of X , the lower the values of Y —
and vice versa. For example, those who use the email-based customer
service ( ) are less satisfied than those who use the telephone
customer service ( ). It is, however, evident that in this example the
values of X have no order and therefore, care must be exercised in
interpreting the covariance.

Solution to Exercise 7.6


Using Tschebyschev’s inequality ( 7.24 )

we can determine c as follows:

Thus, the interval is .

Solution to Exercise 7.7

(a) The joint PDF is:

Y
0 1 2
−1 0.3 0.2 0.2
X
2 0.1 0.1 0.1

(b) The marginal distributions are obtained from the row and column sums of
the joint PDF, respectively:

X −1 2 Y 0 1 2
P 0.7 0.3 0.4 0.3 0.3

(c) The random variables X and Y are independent if

However, in our example we have, for example,

Hence, the two variables are not independent.

(d) The joint distribution of X and Y can be used to obtain the desired
distribution of U . For example, If and , then .
The respective probability is because
and there is no other combination of X -
and Y -values which yields . The distribution of U is therefore as
follows:

k −1 0 1 2 3 4
0.3 0.2 0.2 0.1 0.1 0.1

(e) We calculate
It can be seen that . This makes sense
because we know from ( 7.31 ) that . However,
. This follows from ( 7.7.1 )
which says that and therefore,
only if the covariance is 0. We know from (c)
that X and Y are not independent and thus .

Solution to Exercise 7.8

(a) The random variables X and Y are independent if the balls are drawn with
replacement. This becomes clear by understanding that drawing with
replacement implies that for both the draws, the same balls are in the urn
and the conditions in each draw remain the same. The first draw has no
implications for the second draw.
If we were drawing the balls without replacement, then the first draw
could possibly have implications for the second draw: for instance, if the
first ball drawn was red, then the second one could not be red because there
is only one red ball in the urn. This means that drawing without
replacement implies dependency of X and Y . This can also be seen by
evaluating the independence assumption ( 7.27 ):
(b) The marginal probabilities can be obtained from the given
information. For example, 3 out of 8 balls are black and thus
. The conditional distributions can be calculated easily by
realizing that under the assumed dependency of X and Y , the second draw
is always based on 7 balls (8 balls minus the one drawn under the condition
)—e.g. if the first ball drawn is black, then 7 balls, 2 of which are
black, remain in the urn and . We thus calculate

and obtain

Y
1 2 3
1

X 2 0

(c) The expectations are

To estimate , we need as well as and :


Using ( 7.38 ) and , we obtain

Solution to Exercise 7.9

(a) The constant c must satisfy

and therefore

which is

(b) The marginal distribution is

for .
(c) To determine , we need the cumulative marginal distribution of X
:

Now we can calculate

(d) The conditional distribution is

Solution to Exercise 7.10


If we evaluate the expectation with respect to Y , then both and can be
considered to be constants. We can therefore write

Since , it follows that . The variance is

Applying to this equation yields and therefore


Chapter 8
Solution to Exercise 8.1
The random variable X: “number of packages with a toy” is binomially
distributed. In each of “trials”, a toy can be found with probability .

(a) We thus get

(b) Similarly, we calculate

(c) This question relates to a hypergeometric distribution: there are


packages with packages with toys and packages without
a toy. The daughter gets packages and we are interested in .
Hence, we get

Solution to Exercise 8.2


Given , we get:

(a)

We obtain the same results in R as follows:


(b)

We would have obtained the same results in R using:

Solution to Exercise 8.3


The random variable X follows a discrete uniform distribution because
for each . The expectation and variance are therefore

Solution to Exercise 8.4


Each guess is a Bernoulli experiment where the right answer is given with a
probability of 50 %. The number of correct guesses therefore follows a binomial
distribution, i.e. . The probability of giving the right answer at least
8 times is identical to the probability of not being wrong more than 2 times. We
can thus calculate as :

This relates to

We would have obtained the same results in R using:


Solution to Exercise 8.5

(a) It seems appropriate to model the number of fused bulbs with a Poisson
distribution. We assume, however, that the probabilities of fused bulbs on
two consecutive days are independent of each other; i.e. they only depend
on but not on the time t .

(b) The arithmetic mean is

which means that, on an average, 1.73 bulbs are fused per day. The
variance is

We see that mean and variance are similar, which is an indication that the
choice of a Poisson distribution is appropriate since we assume
and .

(c) The following table lists the proportions (i.e. relative frequencies )
together with the probabilities from a Po (1.73)-distribution. As a
reference, we also list the probabilities from a Po (2)-distribution since it is
not practically possible that 1.73 bulbs stop working and it may hence be
an option to round the mean.

Po (1.73) Po (2)

0.2 0.177 0.135

0.267 0.307 0.27

0.267 0.265 0.27

0.167 0.153 0.18


0.067 0.067 0.09

0.033 0.023 0.036

One can see that observed proportions and expected probabilities are close
together which indicates again that the choice of a Poisson distribution was
appropriate. Chapter 9 gives more details on how to estimate parameters,
such as , from data if it is unknown.

(d) Using , we calculate

Thus, the bulbs are replaced on only 1 % of the days.

(e) If X follows a Poisson distribution then, given Theorem 8.2.1 , Y follows an


exponential distribution with .

(f) The expectation of an exponentially distributed variable is

This means that, on average, it takes more than half a day until one of the
bulbs gets fused.

Solution to Exercise 8.6

(a) Let X be a random variable describing “the number x of winning tickets


among n bought tickets”; then X follows the hypergeometric distribution
. We need to determine n for the conditions specified.
We are interested in

Using the PMF of the hypergeometric distribution


this equates to

We have the following requirement:

To solve this equation, we can program this function for in R


and evaluate it for different numbers of tickets sold, e.g. between 50 and
100 tickets:

The output shows that at least 64 tickets need to be bought to have a 99%
guarantee that at least three tickets win. This equates to spending € 96.

(b) We can plot the function as follows:

Figure B.16 shows the relationship between the number of tickets bought
and the probability of having at least three winning tickets.
(c) The solution of (a) shows that it is well worth taking part in the raffle:
Marco pays €96 and with a probability of 99 % and he wins at least three
prizes which are worth € . More generally, the money
generated by the raffle is € , but the prizes are worth €
. One may suspect that the company produces the
appliances much more cheaply than they are sold for and is thus so
generous.

Fig. B.16 Probability to have at least three winning tickets given the number of tickets bought

Solution to Exercise 8.7


The probability of getting a girl is .

(a) We are dealing with a geometric distribution here. Since we are interested
in , we can calculate:

We would have obtained the same result in R using:


Note that we have to specify “2” rather than “3” for x because R takes the
number of unsuccessful trials rather the number of trials until success.

(b) Here we deal with a binomial distribution with and . We can


calculate as follows:

R would have given us the same result using the dbinom(2,4,0.4878)


command.

Solution to Exercise 8.8

(a) The random variable Y follows a Poisson distribution, see Theorem 8.2.1
for more details.

(b) The fisherman catches, on average, 3 fish an hour. We can thus assume that
the rate is 3 and thus . Similarly, which means
that it takes, on average, 20 min to catch another fish.

(c) Using the PDF of the Poisson distribution, we get:

We would have obtained the same results in R using the dpois(5,3) and
dpois(0,3) commands.

Solution to Exercise 8.9


The random variable “choice of dessert” follows a multinomial distribution.
More precisely, describes whether chocolate brownies were chosen,
describes whether yoghurt was chosen, describes whether lemon tart was
chosen and .

(a) Using the PMF of the multinomial distribution, we get

We would have obtained the same results in R as follows:

(b) The probability of choosing lemon tart for the first two guests is 1. We thus
need to determine the probability that 3 out of the remaining 3 guests order
lemon tart:

Using dmultinom(c(0,0,3),prob=c(0.2,0.3,0.5)) in R , we
get the same result.

(c) The expectation vector is

This means we expect 4 guests to order brownies, 6 to order yoghurt, and


10 to order lemon tart. The covariance matrix can be determined as
follows:

Using and , we obtain the covariance


matrix as:
Solution to Exercise 8.10

Solution to Exercise 8.11

(a) Random numbers of a normal distribution can be generated using the


rnorm command. By default and (see ?rnorm ), so we do
not need to specify these parameters. We simply need to set . The
mean of the 1000 realizations can thus be obtained using
mean(rnorm(1000)) . We can, for example, write a for loop to repeat
this process 1000 times. An empty (= NA ) vector of size 1000 can be used
to store and evaluate the results:

We see that the mean of the arithmetic means is close to zero, but not
exactly zero. The variance is approximately , as one
would expect from the Central Limit Theorem. The distribution is
symmetric, similar to a normal distribution, see Fig. B.17 . It follows that
is approximately distributed, as one could expect from the
Theorem of Large Numbers and the Central Limit Theorem. It is important
to understand that is not fixed but a random variable which follows a
distribution, i.e. the normal distribution.

(b) We can use the same code as above, except we use the exponential instead
of the normal distribution:

The realizations are i.i.d. observations. Once can see that, as in a), is
approximately distributed. It is evident that the do
not necessarily need to follow a normal distribution for to follow a
normal distribution, see also Fig. B.18 a.

(c) Increasing the number of repetitions makes the distribution look closer to a
normal distribution, see Fig. B.18 b. This visualizes that as n tends to
infinity gets closer to a -distribution.

Fig. B.17 Kernel density plot of the distribution simulated in Exercise 8.11a
Fig. B.18 Kernel density plots for Exercises 8.11b and 8.11c
Chapter 9
Solution to Exercise 9.1

(a) The exercise tells us that . Let us look at the


realizations : under the assumption of independence, which we
know is fulfilled because the ’s are i.i.d., and we can write the likelihood
function as the product of the n PMF’s of the Poisson distribution:

It is better to work on a log-scale because it is easy to differentiate. The


results are identical no matter whether we use the likelihood function or the
log-likelihood function because the transformation is monotone in
nature. The log-likelihood function is:

Differentiating with respect to yields

which gives us the ML estimate:

We need to confirm that the second derivative is < 0 at ; otherwise,


the solution would be a minimum rather than a maximum. We get

It follows that the arithmetic mean is the maximum likelihood


estimator for the parameter of a Poisson distribution.

(b) Using the results from (a) we can write the log-likelihood function for
as:
because . We can write down this function in R as follows:

The function can be plotted using the curve command:

Figure B.19 shows the log-likelihood function. It can be seen that the
function reaches its maximum at .

(c) Using (a) we can write the likelihood function as

This means is sufficient for . The arithmetic mean, which is


the maximum likelihood estimate, is a one-to-one function of T and
therefore sufficient too.
Fig. B.19 Illustration of the log-likelihood function

Solution to Exercise 9.2

(a) The probability density function of a normal distribution equates to

with . The likelihood function is therefore

To find the maximum of , it is again easier to work with


the log -likelihood function, which is

Assuming to be 1, differentiating the log-likelihood function with


respect to , and equating it to zero gives us
The ML estimate is therefore .

(b) Looking at the differentiated log-likelihood function in (a) shows us that


the ML estimate of is always the arithmetic mean, no matter whether
is 1 or any other number.

(c) Differentiating the log-likelihood function from (a) with respect to


yields

Using we calculate as

Since the parameter is two-dimensional, it follows that one needs


to solve two ML equations, where is estimated first, and second
(as we did above). It follows further that one needs to look at the positive
definiteness of a matrix (the so-called information matrix) when checking
that the second-order derivatives of the estimates are positive and therefore
the estimates yield a maximum rather then a minimum. However, we omit
this lengthy and time-consuming task here.

Solution to Exercise 9.3


The probability density function of is

Note that this equates to the PDF from Definition 8.2.1 for and . The
likelihood function is therefore

One can see that increases as decreases. The maximum of the


likelihood function is therefore achieved for the smallest valid . In particular,
is minimized when . This follows from the definition
of the PDF which requires that and therefore . Thus, the
maximum likelihood estimate of is , the greatest observed value in the
sample.

Solution to Exercise 9.4

(a) is unbiased, and therefore also asymptotically unbiased, because

Similarly, is unbiased, and therefore also asymptotically unbiased,


because

(b) To calculate the MSE we need to determine the bias and the variance of the
estimators as we know from Eq. ( 9.5 ). It follows from (a) that both
estimators are unbiased and hence the bias is 0. For the variances we get:

Since the mean squared error consists of the sum of the variance and
squared bias, the MSE for and are and , respectively.
One can see that the larger n , the more superior over in terms
of the mean squared error. In other words, is more efficient than
because its variance is lower for any .

(c) Using the results from (b), we get

This means the MSE approaches 0 as n tends to infinity. Therefore, is


MSE consistent for . Since is MSE consistent, it is also weakly
consistent.
Solution to Exercise 9.5

(a) The point estimate of is which is

The variance of can be estimated unbiasedly using :

(b) The variance is unknown and needs to be estimated. We thus need the t -
distribution to construct the confidence interval. We can determine
using qt(0.975,23) or Table C.2 (though the latter is not
detailed enough), and . This yields

The confidence interval for is thus [611.17; 724.66].

(c) We can reproduce these results in R as follows:

Solution to Exercise 9.6


Let us start with the confidence interval for the “Brose Baskets Bamberg”.
Using ( qt(0.975,15) or Table C.2 ) and , we
can determine the confidence interval as follows:
Thus, we get [195.305; 202.815].
For the “Bayer Giants Leverkusen”, we use to get

This leads to a confidence interval of [190.352; 201.648].


For “Werder Bremen”, we need to use the quantile which
yields a confidence interval of

The interval is therefore [185.255; 189.786].


The mean heights of the basketball teams are obviously larger than the
mean height of the football team. The two confidence intervals of the
basketball teams overlap, whereas the intervals of the football team with the
two basketball teams do not overlap. It is evident that this indicates that the
height of football players is substantially less than the height of basketball
players. In Chap. 10 , we will learn that confidence intervals can be used to
test hypotheses about mean differences.

Solution to Exercise 9.7

(a) Using , we calculate an unbiased point estimate for p using .


Note that if the wife has to wash the dishes.

(b) Since and p is sufficiently large, we


can use the normal approximation to calculate a confidence interval for p .
Using , we obtain

This yields a confidence interval of . Note that the confidence


interval does not contain which is the probability that would be
expected if the coin was fair. This is a clear sign that the coin might be
unfair.

(c) If the coin is fair, we can use as our prior judgement. We would
then need

dinners to get the desired precision—which is not possible as this would


constitute a time period of more than 100 years. This shows that the
expectation of such a high precision is unrealistic and needs to be modified.
However, as the results of (b) show, the coin may not be fair. If the coin is
indeed unfair, we may have, for example, and which
gives a smaller sample size. We can thus interpret the sample size as a
conservative estimate of the highest number of dinners needed.

Solution to Exercise 9.8


If a student fails then . We know that and .

(a) Using as point estimate, we get %. Using


, we can calculate the confidence interval as

Using R we get:
This result is different because the above command does not use the normal
approximation. In addition, p is rather small which means that care must be
exercised when using the results of the confidence interval with normal
approximation in this context.

(b) The point estimate of 10.6 % is substantially higher than 3.2 %. The lower
bound confidence interval is still larger than the proportion of failures at
county level. This indicates that the school is doing worse than most other
schools in the county.

Solution to Exercise 9.9

(a) Whether the i th household has switched on the TV and watches


“Germany’s next top model” (GNTM) relates to a random variable with

It follows that is the random variable describing the number of


TVs, out of 2500 TVs, which are switched on and show GNTM. Since the
’s can be assumed to be i.i.d., we can say that X follows a binomial
distribution, i.e. with p unknown. The length of the
confidence interval for p ,

is

Unfortunately, is unknown but because the maximum


value for is 0.25 if . Hence
This means the precision is half the length, i.e. , in the
worst case.

(b) There is the danger of selection bias. A total of 500 households refused to
take part in the study. It may be that their preferences regarding TV shows
are different from the other 2500 households. For example, it may well be
possible that those watching almost no TV refuse to be included; or that
those watching TV shows which are considered embarrassing by society
are refusing as well. In general, missing data may cause point estimates to
be biased, depending on the underlying mechanism causing the absence.

Solution to Exercise 9.10

(a) Using and , we obtain an optimal sample size of at


least

To calculate a confidence interval with a width of not more than 0.2 s, the
results of at least 21 athletes are needed.

(b) The sample size is 30. Thus, the confidence interval width should be
smaller than 0.2 s. This is indeed true as the calculations show:

The width is only 0.18 s.

(c) If we calculate a 80 % confidence interval, then the lower confidence limit


corresponds to the running time which is achieved by only 10 % of the
athletes (of the population). Using we get

The athlete’s best time is thus below the lower confidence limit. He is
among the top 10 % of all athletes, using the results of the confidence
interval.

Solution to Exercise 9.11

(a) The odds ratio is

This means the chances that a pizza arrives in time are about 1.08 times
higher for Laura compared with Melissa. To calculate the 95 % confidence
interval, we need , and

The interval for the log odds ratio is

Exponentiating the interval gives us the 95 % confidence interval for the


odds ratio which is . This indicates that the odds of Laura’s
pizzas arriving earlier than Melissa’s are not much different from one.
While the point estimate tells us that Laura’s pizzas are delivered 1.08
times faster, the confidence interval tells us that there is uncertainty around
this estimate in the sense that it could also be smaller than 1 and Melissa
may not necessarily work more slowly than Laura.

(b) We can reproduce the results in R by attaching the pizza data, creating a
categorical delivery time variable (using cut ) and then applying the
oddsratio command from the library epitools onto the contingency
table:
Chapter 10
Solution to Exercise 10.1
A type I error is defined as the probability of rejecting if is true. This error
occurs if A thinks that B does confess, but B does not. In this scenario, A
confesses, goes free, and B serves a three-year sentence. A type II error is
defined as the probability of accepting , despite the fact that is wrong. In
this case, B does confess, but A does not. In this scenario, B goes free and A
serves a three-year sentence. A type II error is therefore worse for A . With a
statistical test, we always control the type I error, but not the type II error.

Solution to Exercise 10.2

(a) The hypotheses are

(b) It is a one-sample problem for : thus, for known variance, the Gauss test
can be used; the t -test otherwise, see also Appendix D . Since, in this
exercise, is assumed to be known we can use the Gauss test; i.e. we can
compare the test statistic with the normal distribution (as opposed to the t -
distribution when using the t -test). The sample size is small: we must
therefore assume a normal distribution for the data.

(c) To calculate the realized test statistic, we need the arithmetic mean
. Then we obtain

We reject if . Since , the null


hypothesis can be rejected. The test is significant. There is enough
evidence to suggest that the observed weights do not come from a normal
distribution with mean 100 g.

(d) To show that , we need to conduct a one-sided test. Most


importantly, the research hypothesis of interest needs to be stated as the
alternative hypothesis:

(e) The test statistic is the same as in (b): . However, the critical
region changes. gets rejected if . Again, the null
hypothesis is rejected. The producer was right in hypothesizing that the
average weight of his chocolate bars was lower than 100 g.

Solution to Exercise 10.3

(a) We calculate

No auction Auction
16.949 10.995

2.948 2.461

s 1.717 1.569
v 0.101 0.143

Note that we use the unbiased estimates for the variance and the standard
deviation as explained in Chap. 9 ; i.e. we use rather than 1 / n .
It is evident that the mean price of the auctions ( ) is lower than the
mean non-auction price ( ), and also lower than the price from the
online book store. There is, however, a higher variability in relation to the
mean for the auction prices. One may speculate that the best offers are
found in the auctions, but that there are no major differences between the
online store and the internet book store, i.e.
€16.95,
€16.95,
.

(b) We can use the t -test to test the hypotheses


The test statistic is

Using the decision rules from Table 10.2 , we conclude that gets
rejected if . We can calculate either by using
Table C.2 or by using R ( qt(0.975,13) ). Since , we
keep the null hypothesis. There is not enough evidence to suggest that the
prices of the online store differ from €16.95 (which is the price from the
internet book store).

(c) Using ( 9.6 ) we can calculate the upper and lower limits of the confidence
interval as,

respectively. The confidence interval does contain ; hence, the


null hypothesis cannot be rejected. This is the same
conclusion as obtained from the one-sample t -test from above.

(d) We test the following hypotheses:

The realized test statistic is

Table 10.2 tells us that gets rejected if . Since


we reject . The test confirms our hypothesis that the
mean auction prices are lower than €16.95, i.e. the price from the book
store.

(e) We need to conduct two tests: the two-sample t -test for the assumption of
equal variances and the Welch test for the assumption of different
variances.
(i) Two-sample t -test :

To calculate the test statistic ( 10.4 ), we need to determine the pooled


sample variance:

The test statistic is

We reject if . Table C.2 does not list


the quantile; thus, one uses R ( qt(0.95,26) ) to determine the
quantile. Since , we reject the null hypothesis. The test is
significant. There is enough evidence to conclude that the mean
auction prices are lower than the mean non-auction prices.

(ii) Welch test : For the Welch test, we calculate the test statistic, using (
10.6 ) as:

To calculate the critical value, we need the degrees of freedom:

We reject if . Using (in R obtained as


qt(0.95,25.79) ) and , we reject the null hypothesis.
The conclusions are identical to using the two-sample t -test (in this
example). Interestingly, the two test statistics are similar which
indicates that the assumption of equal variances in case (i) was not
unreasonable.

(f) The F -test relies on the assumption of the normal distribution and tests the
hypotheses:

We can calculate the test statistic as described in Appendix C :

The F -distribution is not symmetric around 0; thus, we need to calculate


two critical values: and . Using R we get
( qf(0.975,13,13) ) and (
qf(0.025,13,13) ). is rejected if or
. Since we do not reject . This
means there is strong evidence that the two variances are equal. This is also
reassuring with respect to using the two-sample t -test above (which
assumes equal variances). However, note that testing the equality of
variances with the F -test and then using the two-sample t -test or Welch
test based on the outcome of the F -test is not ideal and not necessarily
correct. In practice, it is best to simply use the Welch test (rather than the t -
test), which is also the default option in the R function t.test .

(g) We need the following table to calculate the rank sums:

Value 9.3 9.52 9.54 10.01 10.5 10.5 10.55 10.59 11.02 11.03
Sample a a a a a a a a a a
Rank 1 2 3 4 5 6 7 8 9 10
Value 11.89 11.99 12 13.79 15.49 15.9 15.9 15.9 15.9 15.9
Sample a a a na a na na na na na
Rank 11 12 13 14 15 16 17 18 19 20
Value 15.99 16.98 16.98 17.72 18.19 18.19 19.97 19.97
Sample na na na na na na na na
Rank 21 22 23 24 25 26 27 28
We can calculate the rank sums as and
, respectively. Thus

With we can calculate the test statistic, which is


approximately normally distributed, as:

Since , the null hypothesis can be rejected. We


conclude that the locations of the two distributions are shifted. Obviously
the prices of the auction are smaller in general, and so is the median.

(h) We can type in the data and evaluate the summary statistics using the
mean, var , and sd commands:

The t.test command can be used to answer questions (b)–(e). For (b)
and (c) we use
The test decision can be made by means of either the p -value (
) or the confidence interval ([15.95; 17.94], which covers
16.95). To answer (d) and (e) we need to make use of the option
alternative which specifies the alternative hypothesis:

Note that the two-sample test provides a confidence interval for the
difference of the means. Questions (f) and (g) can be easily solved by using
the var.test and wilcox.test commands:

The test decision is best made by using the p -value.

Solution to Exercise 10.4


Since the data before and after the diet is dependent (weight measured on the
same subjects), we need to use the paired t -test. To calculate the test statistic, we
need to first determine the weight differences:

Person i 1 2 3 4 5 6 7 8 9 10
Before diet 80 95 70 82 71 70 120 105 111 90
After diet 78 94 69 83 65 69 118 103 112 88
Differences d 2 1 1 −1 6 1 2 2 −1 2

Using and

we calculate the test statistic as

The null hypothesis is rejected because . This


means the mean weight before and after the diet is different. We would have
obtained the same results by calculating the confidence interval for the
differences:

Since the confidence interval does not overlap with zero, we reject the null
hypothesis; there is enough evidence that the mean difference is different (i.e.
greater) from zero. While the test is significant and suggests a weight difference,
it is worth noting that the mean benefit from the diet is only 1.5 kg. Whether this
is a relevant reduction in weight has to be decided by the ten people
experimenting with the diet.

Solution to Exercise 10.5

(a) The production is no longer profitable if the probability of finding a


deficient shirt is greater than 10 . This equates to the hypotheses:

The sample proportion of deficient shirts is . Since


we can use the test statistic
The null hypothesis gets rejected because . It seems
the production is no longer profitable.

(b) The test statistic is . The critical region can be determined


by calculating

Using R we get

The test statistic ( ) does not fall outside the critical region ([14;
32]); therefore, the null hypothesis is not rejected. The same result is
obtained by using the binomial test in R :
binom.test(c(30,200),p=0.1) . This yields a p -value of 0.11239
and a confidence interval covering 10 % ([0.09; 0.18]). Interestingly, the
approximate binomial test rejects the null hypothesis, whereas the exact
test keeps it. Since the latter test is more precise, it is recommended to
follow its outcome.

(c) The research hypothesis is that the new machine produces fewer deficient
shirts:

To calculate the test statistic, we need the following:

This yields:
The null hypothesis is rejected because
. This means we accept the
alternative hypothesis that the new machine is better than the old one.

(d) The data can be summarized in the following table:

Machine 1 Machine 2
Deficient 30 7
Fine 200 112

To test the hypotheses established in (c), we apply the fisher.test


command onto the contingency table:

This yields a p -value of 0.0438 suggesting, as the test in (c), that the null
hypothesis should be rejected. Note that the confidence interval for the
odds ratio, also reported by R , does not necessarily yield the same
conclusion as the test of Fisher.

Solution to Exercise 10.6

(a) To compare the two means, we should use the Welch test (since the
variances cannot be assumed to be equal). The test statistic is

The alternative hypothesis is ; i.e. we deal with a one-sided


hypothesis. The null hypothesis is rejected if . Calculating

yields (using qt(0.95,9.09) in R ; or looking at Table


C.2 for 9 degrees of freedom). Therefore, and the null
hypothesis is not rejected. This means player 2 could not prove that he
scores higher on average.

(b) We have to first rank the merged data:

Value 6 29 40 47 47 64 87 88 91 98
Sample 2 2 2 2 2 2 2 1 1 2
Rank 1 2 3 4 5 6 7 8 9 10
Value 99 99 101 104 105 108 111 112 261 351
Sample 1 1 1 1 1 1 1 1 2 2
Rank 11 12 13 14 15 16 17 18 19 20

This gives us

, and therefore . The test statistic can thus be calculated as

Since , the null hypothesis is rejected. The test


supports the alternative hypothesis of higher points for player 1. The U -test
has the advantage of not being focused on the expectation . The two
samples are clearly different: the second player scores with much more
variability and his distribution of scores is clearly not symmetric and
normally distributed. Since the sample is small, and the assumption of a
normal distribution is likely not met, it makes sense to not use the t -test.
Moreover, because the distribution is skewed the mean may not be a
sensible measure of comparison. The two tests yield different conclusions in
this example which shows that one needs to be careful in framing the right
hypotheses. A drawback of the U -test is that it uses only ranks and not the
raw data: it thus uses less information than the t -test which would be
preferred when comparing means of a reasonably sized sample.

Solution to Exercise 10.7


Otto speculates that the probability of finding a bear of colour i is 0.2, i.e.
, and . This hypothesis can
be tested by using the goodness-of-fit test. The test statistic is

The null hypothesis cannot be rejected because .


While the small number of white bears might be disappointing, the test suggests
that there is not enough evidence to reject the hypothesis of equal probabilities.

Solution to Exercise 10.8

(a) To answer this question, we need to conduct the -independence test. The
test statistic t ( x , y ) is identical to Pearson’s statistic, introduced in
Chap. 4 . In Exercise 4.4, we have calculated this statistic already,
, see p. 350 for the details. The null hypothesis of independence is rejected
if . With (number of rows), and
(number of columns) we get using Table C.3 (or
qchisq(0.95,3) in R ). Since , we reject the null hypothesis
of independence.

(b) The output refers to a test of homogeneity: the null hypothesis is that the
proportion of passengers rescued is identical for the different travel classes.
This hypothesis is rejected because p is smaller than . It is evident
that the proportions of rescued passengers in the first two classes (60 %,
43.9 %) are much higher than in the other classes (25 %, 23.8 %). One can
see that the test statistic (182.06) is identical to (a). This is not surprising:
the -independence test and the test of homogeneity are technically
identical, but the null hypotheses differ. In (a), we showed that “rescue
status” and “travel class” are independent; in (b), we have seen that the
conditional distributions of rescue status given travel class differ by travel
class, i.e. that the proportions of those rescued differ by the categories 1.
class/2. class/3. class/staff.

(c) The summarized table is as follows:


1. Class/2. Class 3. Class/Staff Total
Rescued 327 391 718
Not rescued 295 1215 1510
Total 622 1606 2228

Using ( 4.7 ) we get

The -independence test and the test of homogeneity are technically


identical: is rejected if . Since
the null hypothesis is rejected. As highlighted in (b) this has two
interpretations: (i) “rescue status” and “travel class” are independent
(independence test) and (ii) the conditional distributions of rescue status
given travel class differ by travel class (homogeneity test). The second
interpretation implies that the probability of being rescued differs by travel
class. The null hypothesis of the same probabilities of being rescued is also
rejected using the test of Fisher. Summarizing the data in a matrix and
applying the fisher.test command yields a p -value smaller than
.

Solution to Exercise 10.9

(a) The hypotheses are:

The pooled variance,

is needed to calculate the test statistic:


is rejected if 2.02. Since we do not
reject the null hypothesis.

(b) The hypotheses are:

The pooled variance,

is needed to calculate the test statistic:

is rejected because 2.02.

(c) In both (a) and (b), there exists a true difference in the mean. However,
only in (b) is the t -test able to detect the difference. This highlights that
smaller differences can only be detected if the sample size is sufficiently
large. However, if the sample size is very large, it may well be that the test
detects a difference where there is no difference.

Solution to Exercise 10.10

(a) After reading in and attaching the data, we can simply use the t.test
command to compare the expenditure of the two samples:
We see that the null hypothesis is not rejected because (also,
the confidence interval overlaps with “0”).

(b) A two-sample t -test and a U -test yield the same conclusion. The p -values,
obtained with

are 0.1946 and 0.145, respectively. Interestingly, the test statistic of the
two-sample t -test is almost identical to the one from the Welch test in (a) (
)—this indicates that the assumption of equal variances may not be
unreasonable.

(c) We can use the usual t.test command together with the option
alternative = ’greater’ to get the solution.

Here, p is much smaller than 0.001; hence, the null hypothesis can be
rejected. Women spend more on theatre visits than men.

(d) We deal with dependent (paired) data because different variables


(expenditure this year versus expenditure last year) are measured on the
same observations. Thus, we need to use the paired t -test—which we can
use in R by specifying the paired option:
Both the p -value (which is greater than 0.05) and the confidence interval
(overlapping with “0”) state that the null hypothesis should be kept. The
mean difference in expenditure (3.1 SFR) is not large enough to suggest
that this difference is not caused by chance. Visitors spend, on average,
about the same in the two years.

Solution to Exercise 10.11

(a) We can use the one-sample t -test to answer both questions:

We cannot confirm that the mean delivery time is less than 30 min and that
the mean temperature is greater than 65 C. This is not surprising: we
have already seen in Exercise 3.10 that the manager should be unsatisfied
with the performance of his company.
(b) We can use the exact binomial test to investigate and
. For the binom.test command, we need to know the
numbers of successes and failures, i.e. the number of deliveries where a
free wine should have been given to the customer. Applying the table
commands yields 229 and 1037 deliveries, respectively.

The null hypothesis cannot be rejected because and


because the confidence interval covers . We get the same results if
we use the variable “got_wine” instead of “free_wine”. While the test says
that we cannot exclude the possibility that the probability of receiving a
free wine is less than 15 %, the point estimate of 18 % suggests that the
manager still has to improve the timeliness of the deliveries or stop the
offer of free wine.

(c) We first need to create a new categorical variable (using cut ) which
divides the temperatures into two parts: below and above 65 C. Then we
can simply apply the test commands ( fisher.test , chisq.test ,
prop.test ) to the table of branch and temperature:
We know that the two tests lead to identical results. For both of them the
p -value is 0.2283 which suggests that we should keep the null hypothesis.
There is not enough evidence which would support that the proportion of
hot pizzas differs by operator, i.e. that the two variables are independent!
The test of Fisher yields the same result ( ).

(d) The null hypothesis is that the proportion of deliveries is the same for each
branch: . To test this hypothesis, we need a
goodness-of-fit test:

The null hypothesis of equal proportions is therefore not rejected (


).

(e) To compare three proportions we need to use the homogeneity test:

We can see that the proportions are almost identical and that the null
hypothesis is not rejected ( ).

To test this relationship, we use the -independence test:


(f)
The null hypothesis of independence is rejected.

Solution to Exercise 10.12


To test the hypothesis we use the goodness-
of-fit test. The test statistic is

gets rejected if . Thus, the null hypothesis is not


rejected. We accept that all three authors took about one-third of the pictures.
Chapter 11
Solution to Exercise 11.1

(a) Calculating and


, we obtain the following table
needed for the estimation of and :

Body mass index Systolic blood pressure

26 0.5 0.25 170 10 100 5


23 −2.5 6.25 150 −10 100 25
27 1.5 2.25 160 0 0 0
28 2.5 6.25 175 15 225 37.5
24 −1.5 2.25 155 −5 25 7.5
25 −0.5 0.25 150 −10 100 5
Total 153 17.5 960 550 80

With , it follows that . Moreover, we


get and . The parameter
estimates are therefore

A one-unit increase in the BMI therefore relates to a 4.57 unit increase in


the blood pressure. The model suggests a positive association between BMI
and systolic blood pressure. It is impossible to have a BMI of 0; therefore,
cannot be interpreted meaningfully here.

(b) Using ( 11.14 ), we obtain as


Thus 66 % of the data’s variability can be explained by the model. The
goodness of fit is good, but not perfect.

Solution to Exercise 11.2

(a) To estimate , we use the second equality from ( 11.6 ):

Calculating
leads
to

For children who spend no time on the internet at all, this model predicts
6.16 h of deep sleep. Each hour spent on the internet decreases the time in
deep sleep by 0.45 h which is 27 min.

(b) Using the results from (a) and yields:

About 45 % of the variation can be explained by the model. The fit of the
model to the data is neither very good nor very bad.

(c) After collecting the data in two vectors ( c() ), printing a summary of the
linear model ( summary(lm()) ) reproduces the results. A scatter plot
can be produced by plotting the two vectors against each other ( plot() ).
The regression line can be added with abline() :
The plot is displayed in Fig. B.20 .

(d) Treating X as a binary variable yields the following values:


0, 1, 0, 0, 0, 1, 1, 0, 1. We therefore have , and
. Since the Y -values stay the same we
calculate

Thus, those children who are on the internet for a long time (i.e. >1 h) sleep
on average 0.85 h ( 51 min) less. If we change the coding of 1’s and 0’s,
will just have a different sign: . In this case, we can conclude
that children who spend less time on the internet sleep on average 0.85 h
longer than children who spend more time on the internet. This is the same
conclusion and highlights that the choice of coding does not affect the
interpretation of the results.

Fig. B.20 Scatter plot and regression line for the association of internet use and deep sleep

Solution to Exercise 11.3


(a) The correlation coefficient is

This indicates strong positive correlation: the higher the height, the higher
the weight. Since , we already know that the fit of a
linear regression model will be good (no matter whether height or weight
is treated as outcome). From ( 11.11 ), we also know that will be
positive.

(b) We know from (a) that and that . The least


squares estimates are therefore

Each centimetre difference in height therefore means a 1.129 kg


difference in weight. It is not possible to interpret meaningfully in this
example.

(c) The prediction is

(d)– The black dots in Fig. B.21 show the scatter plot of the data. There is
(g) clearly a positive association in that greater height implies greater weight.
This is also emphasized by the regression line estimated in (b). The two
additional points appear in dark grey in the plot. It is obvious that they do
not match the pattern observed in the original 17 data points. One may
therefore speculate that with the inclusion of the two new points will be
smaller. To estimate the new regression line we need
This yields

This shows that the two added points shrink the estimate from 1.129 to
0.28. The association becomes less clear. This is an insightful example
showing that least squares estimates are generally sensitive to outliers
which can potentially affect the results.

Fig. B.21 Scatter plot and regression line for both 17 and 19 observations

Solution to Exercise 11.4

(a) The point estimate of suggests a 0.077 % increase of hotel occupation for
each one degree increase in temperature. However, the null hypothesis of
cannot be rejected because . We therefore cannot
show an association between temperature and hotel occupation.
(b) The average hotel occupation is higher in Davos (7.9 %) and Polenca
(0.9 %) compared with Basel (reference category). However, these
differences are not significant. Both and
cannot be rejected. The model cannot show a significant difference in hotel
occupation between Davos/Polenca and Basel.

(c) The analysis of variance table tells us that the null hypothesis of equal
average temperatures in the three cities ( ) cannot be rejected.
Note that in this example the overall F -test would have given us the same
results.

(d) In the multivariate model, the main conclusions of (a) and (b) do not
change: testing never leads to the rejection of the null
hypothesis. We cannot show an association between temperature and hotel
occupation (given the city); and we cannot show an association between
city and hotel occupation (given the temperature).

(e) Stratifying the data yields considerably different results compared to (a)–
(c): In Davos, where tourists go for skiing, each increase of 1 C relates to
a drop in hotel occupation of 2.7 %. The estimate is also
significantly different from zero ( ). In Polenca, a summer
holiday destination, an increase of 1 C implies an increase of hotel
occupation of almost 4 %. This estimate is also significantly different from
zero ( ). In Basel, a business destination, there is a
somewhat higher hotel occupation for higher temperatures ( );
however, the estimate is not significantly different from zero. While there
is no overall association between temperature and hotel occupation (see (a)
and (c)), there is an association between them if one looks at the different
cities separately. This suggests that an interaction between temperature and
city should be included in the model.

(f) The design matrix contains a column of 1’s (to model the intercept), the
temperature and two dummies for the categorical variable “city” because it
has three categories. The matrix also contains the interaction terms which
are both the product of temperature and Davos and temperature and
Polenca. The matrix has 36 rows because there are 36 observations: 12 for
each city.

(g) Both interaction terms are significantly different from zero (


and ). The estimate of temperature therefore differs by city,
and the estimate of city differs by temperature. For the reference city of
Basel, the association between temperature and hotel occupation is
estimated as 1.31; for Davos it is and for Polenca
. Note that these results are identical to (d) where we
fitted three different regressions—they are just summarized in a different
way.

(h) From (f) it follows that the point estimates for are 1.31 for Basel,
for Davos, and 3.97 for Polenca. Confidence intervals for these
estimates can be obtained via ( 11.29 ):

We calculate . With
(obtained via from the model output or from the second row and
second column of the covariance matrix),
, and also
we obtain:
The 95 % confidence intervals are therefore:

Solution to Exercise 11.5

(a) The missing value [1] can be calculated as

Since and , it is clear that the p -value from


[2] is smaller than 0.05. The exact p -value can be calculated in R via (1-
pt(2.019, 693))*2 which yields 0.0439. The command gives
the probability value for the quantile of 2.019 (with 693 degrees of
freedom): 0.978. Therefore, with probability % a value is right
of 2.019 in the respective t -distribution which gives, multiplied by two to
account for a two-sided test, the p -value.

(b)– The plot on the left shows that the residuals are certainly not normally
(c) distributed as required by the model assumptions. The dots do not
approximately match the bisecting line. There are too many high positive
residuals which means that we are likely dealing with a right-skewed
distribution of residuals. The plot on the right looks alright: no systematic
pattern can be seen; it is a random plot. The histogram of both theatre
expenditure and log(theatre expenditure) suggests that a log-
transformation may improve the model, see Fig. B.22 . Log-
transformations are often helpful when the outcome’s distribution is
skewed to the right.

(d) Since the outcome is log-transformed, we can apply the interpretation of a


log-linear model:
Each year’s increase in age yields an times
higher ( 0.38 %) expenditure on theatre visits. Therefore, a 10-year
age difference relates to an times higher
expenditure ( 3.8 %).
Women (gender 1) spend on average (given the other variables)
times more money on theatre visits.
Each 1000 SFR more yearly income relates to an
times higher expenditure on theatre visits. A difference in 10,000
SFR per year therefore amounts to an 8.8 % difference in
expenditure.
Each extra Swiss Franc spent on cultural activities is associated with
an times higher expenditure on theatre visits.
Except for theatre expenditure from the preceding year, all are
significantly different from zero.

(e) While in (b) the residuals were clearly not normally distributed, this
assumption seems to be fulfilled now: the QQ-plot shows dots which lie
approximately on the bisecting line. The fitted values versus residuals plot
remains a chaos plot. In conclusion, the log-transformation of the
outcome helped to improve the quality of the model.

Fig. B.22 Histogram of the outcome


Solution to Exercise 11.6

(a) The multivariate model is obtained by using the lm() command and
separating the covariates with the sign. Applying summary() to the
model returns the comprehensive summary.

The output shows that lower temperature, higher bills, and more ordered
pizzas increase the delivery times. The branch in the East is the fastest, and
so is the driver Domenico. While there are differences with respect to day,
discount customers, and the operator, they are not significant at the 5 %
level.
(b) The confidence intervals are calculated as: . We know
from the model output from (a) that there are 1248 degrees of freedom
(1266 observations − 18 estimated coefficients). The respective quantile
from the t -distribution is obtained with the qt() function. The
coefficients are accessed via the coefficients command (alternatively:
mp$coefficients ); the variances of the coefficients are either
accessed via the diagonal elements of the covariance matrix (
diag(vcov(mp)) ) or the model summary ( summary(mp)[[4]]
[,2] )—both of which are laborious. The summary of coefficients, lower
confidence limit (lcl), and upper confidence limit (ucl) may be summarized
in a matrix, e.g. via merging the individual columns with the cbind
command.

(c) The variance is estimated as the residual sum of squares divided by the
degrees of freedom, see also ( 11.27 ). Applying the residuals
command to the model and using other basic operations yields an estimated
variance of 28.86936.

Taking the square root of the result yields which is also


reported in the model output from (a) under “Residual standard error”.
(d) The sum of squares error is defined as . The total sum of
squares is . This can be easily calculated in R . The goodness of
fit is then obtained as . Dividing by
and by yields
. This corresponds to the model output from (a).
(e) Applying stepAIC to the fitted model (with option “back” for backward
selection) executes the model selection by means of AIC.

The output shows that the full model has an AIC of 4275.15. The smallest
AIC is achieved by removing the operator variable from the model.
The reduced model has an AIC of 4273.37. Removing the discount
customer variable from the model yields an improved AIC (
).

The model selection procedure stops here as removing any variable would
only increase the AIC, not decrease it.
The final model, based on backward selection with AIC, includes day,
driver, branch, number of pizzas ordered, temperature, and bill.

(f) Fitting the linear model with the variables obtained from (e) and obtaining
the summary of it yields an of 0.3092.

This is only marginally higher than the goodness of fit from the full model
( ). While the selected model is better than the model with
all variables, both, with respect to AIC and , the results are very close
and remind us of the possible instability of applying automated model
selection procedures.

(g) Both the normality assumption and heteroscedasticity can be checked by


applying to the model. From the many graphs provided we
concentrate on the second and third of them:
Figure B.23 a shows that the residuals are approximately normally
distributed because the dots lie approximately on the bisecting line. There
are some smaller deviations at the tails but they are not severe. The plot of
the fitted values versus the square root of the absolute values of the
residuals shows no pattern; it is a random plot (Fig. B.23 b). There seems
to be no heteroscedasticity.

(h) Not all variables identified in (e) represent necessarily a “cause” for
delayed or improved delivery time. It makes sense to speculate that
because many pizzas are being delivered (and need to be made!) the
delivery time increases. There might also be reasons why a certain driver is
improving the delivery time: maybe he does not care about red lights. This
could be investigated further given the results of the model above.
However, high temperature does not cause the delivery time to be shorter;
likely it is the other way around: the temperature is hotter because the
delivery time is shorter. However, all of these considerations remain
speculation. A regression model only exhibits associations. If there is a
significant association, we know that given an accepted error (e.g. 5 %),
values of are higher when values of are higher. This is useful but it
does not say whether caused or vice versa.

(i) To check whether it is worth to add a polynomial, we simply add the


squared temperature to the model. To make R understand that we apply a
transformation we need to use I() .
It can be seen that the null hypothesis is rejected. This
indicates that it is worthwhile to assume a quadratic relationship between
temperature and delivery time.

(j) The prediction can be obtained by the predict command as follows:

The prediction is 36.5 min and therefore 0.8 min higher than the real
delivery time.

Fig. B.23 Checking the model assumptions

Technical Appendix C
More details on Chap. 3
Proof of equation ( 3.27 ).
We obtain the following expressions for [ i ]–[ iii ]:

Since [ i ] is the within-class variance and [ ii ] is the between-class variance,


Eq. ( 3.27 ) holds.

More details on Chap. 7


Proof of Theorem 7.2.3 . Consider the interval with . From ( 7.12
) it follows that and therefore

Proof of Theorem 7.3.1 .

Proof of Theorem 7.4.1 . We define a discrete random variable Y as


(C.1)
The respective probabilities are and . The
definition of Y in ( C.1 ) implies that

since for Y takes the value and therefore . If


takes the value , and therefore . Using this
knowledge, we obtain

However, for Y we also have

which means that we can summarize the above findings in the following
inequality:

This equates to Tschebyschev’s inequality. Using , i.e.

we obtain the second formula of Tschebyschev’s inequality:

Proof of rule ( 7.30 ). For discrete variables, we have

Since and , we obtain , which is rule


( 7.30 ). In the continuous case, we have

Proof of rule ( 7.33 ). Using , we can write the variance of


bX as
Using ( 7.29 ), we get and . Therefore

Proof of for a perfect linear relationship. If with , we get

because , see p. 147. Then

and therefore

if . Similarly, if with we get .

More details on Chap. 8


Theorem of Large Numbers. To explain the Theorem of Large Numbers, we
first need to first define stochastic convergence .

Definition C.1
A sequence of random variables, , converges stochastically to 0, if for
any

(C.2)
holds.

This is equivalent to .

Theorem C.1
(Theorem of large numbers) Consider n i.i.d. random variables with
and . It holds that

(C.3)

This implies that converges stochastically to . As another motivation, recall


Definition 7.6.1 where we define random variables to be i.i.d.
(independently identically distributed) if all follow the same distribution and
are independent of each other. Under this assumption, we showed in ( 7.36 ) that
. It follows that the larger n , the smaller the variance. If we apply
Tschebyschev’s inequality (Theorem 7.4.1 ) to , we get the following equation
for :

(C.4)
This means that for each , the right-hand side of the above equation tends
to 1 as which gives a similar interpretation as the Theorem of Large
Numbers.
Central Limit Theorem. Let ( ) be n i.i.d. random variables
with and . If we consider the sum , we obtain
and . If we want to standardize we can
use Theorem 7.3.2 to obtain

(C.5)
i.e. it holds that and .

Theorem C.2
(Central Limit Theorem) Let ( ) be n i.i.d. random variables with
and . denotes the standardized sum of .
The CDF of is

where denotes the CDF of the standard normal distribution.

This theorem tells us that is approximately standard normal distributed if n is


large, i.e.

This is equivalent to saying that is approximately distributed if


n is large, i.e.
(C.6)

As a consequence is distributed for large n , i.e.

In summary, the Theorem of Large Numbers tells us that converges


stochastically to , whereas the Central Limit Theorem tells us that
converges to a -distribution as n tends to infinity.
PDF of the -Distribution. The PDF of the - distribution , with n
degrees of freedom, is defined as

(C.7)
Note that is the Gamma function, defined as for positive
integers and otherwise.
PDF of the -Distribution. The PDF of the t -distribution, with n degrees of
freedom, is defined as

(C.8)
Note that is the Gamma function, defined as for positive
integers and otherwise.
PDF of the -Distribution. The PDF of the F -distribution, with n and m
degrees of freedom, respectively, is defined as

(C.9)
The PDF of the F -distribution with m and n degrees of freedom can be
derived by interchanging the roles of m and n .

More details on Chap. 9


Another Example of Sufficiency. Let be a random sample from
where and are both unknown. We attempt to find a sufficient
statistic for and .

Here the joint density depends on through two statistics


and with . Thus
and are jointly sufficient for and . Therefore,
and are jointly sufficient for and as they are a one-to-
one function of and . They are the maximum likelihood estimates for
and .

More details on Chap. 10


Exact Test of Fisher. Similar to the approximate two-sample binomial test in
Sect. 10.4.2 , we assume two samples following a binomial distribution with
parameters and , respectively.

For the sums of these random variables, we get:

Let . The Exact Test of Fisher uses the fact that the row marginal
frequencies and in the following table

Success Failure Total


Population A X

Population B

Total Z
are fixed by the sample sizes and . Conditional on the total number of
successes (i.e. the column margins are assumed to be fixed), the only
remaining random variable is X (since the other three entries of the table are then
determined by the realization x of X and the margins). Under , it
can be shown that

i.e.

The proof is straightforward using the idea of conditioning on and the


assumption of independence of the two samples:

Note that in the equation above we use the fact that under is B (
n , p ) distributed; see the additivity theorem for the binomial distribution, i.e.
Theorem 8.1.1 .

Example C.1
Consider two competing lotteries A and B. Say we buy 10 tickets from each
lottery and test the hypothesis of equal winning probabilities. The data can be
summarized in a table:

Winning Not winning Total


Lottery A 1 24 25
Lottery B 7 18 25
Total 8 42 50

In R , we can use the command fisher.test to perform the test. Using


the example data and , we get

with output

For the example data and , the null hypothesis is rejected, since the
p -value is lower than . For the calculation of the p -value, the one-sided and
two-sided cases have to be distinguished. The idea is that while fixing the
margins at the observed values (i.e. 25, 25, 8, 42), we have to calculate the sum
of the probabilities of all tables which have lower probability than the observed
table. In R , one can use the functions dhyper and phyper for calculating
(cumulative) probabilities. For example, can be calculated as

A more extreme table than the observed one is

0 25 25
8 17 25
8 42 50
with probability dhyper(0,25,25,8) , which is
lower than . The sum is which is the (left) one-
sided p -value. In this case (not generally true!), the two-sided p -value is simply
two times the one-sided value, i.e. .

Remark C.1
The two-sided version of the Exact Test of Fisher can also be used as a test of
independence of two binary variables. It is equivalent to the test of equality of
two proportions, see Example 10.8.2 .

One-Sample -Test for Testing Hypothesis About the Variance. We assume


a normal population, i.e. and an i.i.d. sample
distributed as X . We only introduce the test for two-sided hypotheses

versus

The test statistic

follows a -distribution under . The critical region is constructed by


taking the - and quantile as critical values; i.e. is rejected, if

or if

where and are the desired quantiles of a -distribution. In R ,


the test can be called by the sigma.test function in the TeachingDemos
library or the varTest function in library EnvStats . Both functions also
return a confidence interval for the desired confidence level. Note that the test is
biased. An unbiased level test would not take at the tails but two different
tail probabilities and with .
-Test for Comparing Two Variances. Comparing variances can be of
interest when comparing the variability, i.e. the “precision” of two industrial
processes; or when comparing two medical treatments with respect to their
reliability. Consider two populations characterized by two independent random
variables X and Y which follow normal distributions:

For now, we distinguish the following two hypotheses:

The third hypothesis with is similar to the second hypothesis


where X and Y are replaced with each other.
Test Statistic
Let and be two independent random samples of
size and . The test statistic is defined as the ratio of the two sample
variances

(C.10)
which is, under the null hypothesis, F -distributed with and
degrees of freedom, see also Sect. 8.3.3 .
Critical Region
Two-Sided Case. The motivation behind the construction of the critical
region for the two-sided case, : vs. : , is that if the null
hypothesis is true (i.e. the two variances are equal) then the test statistic ( C.10 )
would be 1; also, . Therefore, very small (but positive) and very large
values of should lead to a rejection of . The critical region can then
be written as , where and are critical values such that

Here and can be calculated from the quantile function of the F -


distribution as

Example C.2
Let and . Using the qf command in R , we can
determine the critical values as:
The results are and .

Remark C.2
There is the following relationship between quantiles of the F -distribution:

One-Sided Case. In the one-sided case, the alternative hypothesis is always


formulated in such a way that the variance in the numerator is greater than the
variance in the denominator. The hypotheses are : versus :
or : versus : . This means it does not matter
whether or ; by constructing the test statistic in the
correct way, we implicitly specify the hypothesis. The critical region K consists
of the largest values of , i.e. , where k has to fulfil the
condition

The resulting critical value is denoted by .


Observed Test Statistic
Using the sample variances, the realized test statistic t is calculated as:

Test Decisions

Case Reject , if

(a) or

(b)

Remark C.3
We have tacitly assumed that the expected values and are unknown and
have to be estimated. However, this happens rarely, if ever, in practice. When
estimating the expected values by the arithmetic means, it would be appropriate
to increase the degrees of freedom from to and to .
Interestingly, standard software will not handle this case correctly.

Example C.3
A company is putting baked beans into cans. Two independent machines at two
sites are used. The filling weights are assumed to be normally distributed with
mean 1000 g. It is speculated that one machine is more precise than the other.
Two samples of the two machines give the following results:

Sample n

X 20 1000.49 72.38
Y 25 1000.26 45.42

With and the hypotheses

we get the following quantiles

The observed test statistic is

Therefore, is not rejected, since . We cannot reject the


hypothesis of equal variability of the two machines. In R , the F -test can be used
using the command var.test .

Remark C.4
For the t -test, we remarked that the assumption of normality is not crucial
because the test statistic is approximately normally distributed, even for
moderate sample sizes. However, the F -test relies heavily on the assumption of
normality. This is why alternative tests are often used, for example the Levene’s
test.
More details on Chap. 11
Obtaining the Least Squares Estimates in the Linear Model. The function S (
a , b ) describes our optimization problem of minimizing the residual sum of
squares:

Minimizing S ( a , b ) is achieved using the principle of maxima and minima


which involves taking the partial derivatives of S ( a , b ) with respect to both a
and b and setting them equal to 0. The partial derivatives are

(C.11)

(C.12)
Now we set ( C.11 ) and ( C.12 ) as equal to zero, respectively:

This equates to

Multiplying (I ) by yields

which gives us the solution for a :

Putting this solution into (II ) gives us

Using leads to

If we use
and

we eventually obtain the least squares estimate of b :

Remark C.5
To show that the above solutions really relate to a minimum, and not to a
maximum, we would need to look at all the second-order partial derivatives of S
( a , b ) and prove that the bordered Hessian matrix containing these derivatives
is always positive definite. We omit this proof however.

Variance Decomposition.
We start with the following equation:

If we square both sides we obtain

The last term on the right-hand side is

We therefore obtain

which equates to
The Relation between and .

We therefore obtain

The Least Squares Estimators are Unbiased.

Given that in the model is assumed to be fixed (i.e. non-stochastic and not
following any distribution), we obtain

Since it follows that and therefore

How to Obtain the Variance of the Least Squares Estimator. With the
same arguments as above (i.e is fixed and non-stochastic) and applying the
rule from the scalar case to matrices we obtain:

Maximum Likelihood Estimation in the Linear Model. The linear model


follows a normal distribution:

Therefore, the likelihood function of also follows a normal distribution:

The log-likelihood function is given by

To obtain the maximum likelihood estimates of and , one needs to


obtain the maxima of the above function using the principle of maxima and
minima that involves setting the partial derivatives equal to zero and finding the
solution:

We therefore have

which give us the ML estimates of and in the linear regression model.


Here, we also need to check the Hessian matrix of second-order partial
derivatives to show that we really found a minimum and not a maximum. We
omit this proof however.

Distribution Tables
See Tables C.1 , C.2 and C.3 .

Table C.1 CDF values for the standard normal distribution, . These values can also be obtained in R
by using the pnorm(p) command

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 0.500000 0.503989 0.507978 0.511966 0.515953 0.519939 0.523922 0.527903 0.531881 0.535856
0.1 0.539828 0.543795 0.547758 0.551717 0.555670 0.559618 0.563559 0.567495 0.571424 0.575345
0.2 0.579260 0.583166 0.587064 0.590954 0.594835 0.598706 0.602568 0.606420 0.610261 0.614092
0.3 0.617911 0.621720 0.625516 0.629300 0.633072 0.636831 0.640576 0.644309 0.648027 0.651732
0.4 0.655422 0.659097 0.662757 0.666402 0.670031 0.673645 0.677242 0.680822 0.684386 0.687933
0.5 0.691462 0.694974 0.698468 0.701944 0.705401 0.708840 0.712260 0.715661 0.719043 0.722405
0.6 0.725747 0.729069 0.732371 0.735653 0.738914 0.742154 0.745373 0.748571 0.751748 0.754903
0.7 0.758036 0.761148 0.764238 0.767305 0.770350 0.773373 0.776373 0.779350 0.782305 0.785236
0.8 0.788145 0.791030 0.793892 0.796731 0.799546 0.802337 0.805105 0.807850 0.810570 0.813267
0.9 0.815940 0.818589 0.821214 0.823814 0.826391 0.828944 0.831472 0.833977 0.836457 0.838913
1.0 0.841345 0.843752 0.846136 0.848495 0.850830 0.853141 0.855428 0.857690 0.859929 0.862143
1.1 0.864334 0.866500 0.868643 0.870762 0.872857 0.874928 0.876976 0.879000 0.881000 0.882977
1.2 0.884930 0.886861 0.888768 0.890651 0.892512 0.894350 0.896165 0.897958 0.899727 0.901475
1.3 0.903200 0.904902 0.906582 0.908241 0.909877 0.911492 0.913085 0.914657 0.916207 0.917736
1.4 0.919243 0.920730 0.922196 0.923641 0.925066 0.926471 0.927855 0.929219 0.930563 0.931888
1.5 0.933193 0.934478 0.935745 0.936992 0.938220 0.939429 0.940620 0.941792 0.942947 0.944083
1.6 0.945201 0.946301 0.947384 0.948449 0.949497 0.950529 0.951543 0.952540 0.953521 0.954486
1.7 0.955435 0.956367 0.957284 0.958185 0.959070 0.959941 0.960796 0.961636 0.962462 0.963273
1.8 0.964070 0.964852 0.965620 0.966375 0.967116 0.967843 0.968557 0.969258 0.969946 0.970621
1.9 0.971283 0.971933 0.972571 0.973197 0.973810 0.974412 0.975002 0.975581 0.976148 0.976705
2.0 0.977250 0.977784 0.978308 0.978822 0.979325 0.979818 0.980301 0.980774 0.981237 0.981691
2.1 0.982136 0.982571 0.982997 0.983414 0.983823 0.984222 0.984614 0.984997 0.985371 0.985738
2.2 0.986097 0.986447 0.986791 0.987126 0.987455 0.987776 0.988089 0.988396 0.988696 0.988989

2.3 0.989276 0.989556 0.989830 0.990097 0.990358 0.990613 0.990863 0.991106 0.991344 0.991576
2.4 0.991802 0.992024 0.992240 0.992451 0.992656 0.992857 0.993053 0.993244 0.993431 0.993613
2.5 0.993790 0.993963 0.994132 0.994297 0.994457 0.994614 0.994766 0.994915 0.995060 0.995201
2.6 0.995339 0.995473 0.995604 0.995731 0.995855 0.995975 0.996093 0.996207 0.996319 0.996427
2.7 0.996533 0.996636 0.996736 0.996833 0.996928 0.997020 0.997110 0.997197 0.997282 0.997365
2.8 0.997445 0.997523 0.997599 0.997673 0.997744 0.997814 0.997882 0.997948 0.998012 0.998074
2.9 0.998134 0.998193 0.998250 0.998305 0.998359 0.998411 0.998462 0.998511 0.998559 0.998605
3.0 0.998650 0.998694 0.998736 0.998777 0.998817 0.998856 0.998893 0.998930 0.998965 0.998999

Table C.2 ( ) quantiles for the t -distribution. These values can also be obtained in R using the
qt(p,df) command.

df

0.95 0.975 0.99 0.995


1 6.3138 12.706 31.821 63.657
2 2.9200 4.3027 6.9646 9.9248

3 2.3534 3.1824 4.5407 5.8409


4 2.1318 2.7764 3.7469 4.6041
5 2.0150 2.5706 3.3649 4.0321
6 1.9432 2.4469 3.1427 3.7074
7 1.8946 2.3646 2.9980 3.4995
8 1.8595 2.3060 2.8965 3.3554
9 1.8331 2.2622 2.8214 3.2498
10 1.8125 2.2281 2.7638 3.1693
11 1.7959 2.2010 2.7181 3.1058
12 1.7823 2.1788 2.6810 3.0545
13 1.7709 2.1604 2.6503 3.0123
14 1.7613 2.1448 2.6245 2.9768
15 1.7531 2.1314 2.6025 2.9467

16 1.7459 2.1199 2.5835 2.9208


17 1.7396 2.1098 2.5669 2.8982
18 1.7341 2.1009 2.5524 2.8784
19 1.7291 2.0930 2.5395 2.8609
20 1.7247 2.0860 2.5280 2.8453
30 1.6973 2.0423 2.4573 2.7500

40 1.6839 2.0211 2.4233 2.7045

50 1.6759 2.0086 2.4033 2.6778


60 1.6706 2.0003 2.3901 2.6603
70 1.6669 1.9944 2.3808 2.6479
80 1.6641 1.9901 2.3739 2.6387
90 1.6620 1.9867 2.3685 2.6316
100 1.6602 1.9840 2.3642 2.6259
200 1.6525 1.9719 2.3451 2.6006
300 1.6499 1.9679 2.3388 2.5923
400 1.6487 1.9659 2.3357 2.5882
500 1.6479 1.9647 2.3338 2.5857

Table C.3 ( ) quantiles of the -distribution. These values can also be obtained in R using the
qchisq(p,df) command

df

0.01 0.025 0.05 0.95 0.975 0.99


1 0.0001 0.001 0.004 3.84 5.02 6.62

2 0.020 0.051 0.103 5.99 7.38 9.21


3 0.115 0.216 0.352 7.81 9.35 11.3

4 0.297 0.484 0.711 9.49 11.1 13.3


5 0.554 0.831 1.15 11.1 12.8 15.1
6 0.872 1.24 1.64 12.6 14.4 16.8
7 1.24 1.69 2.17 14.1 16.0 18.5
8 1.65 2.18 2.73 15.5 17.5 20.1
9 2.09 2.70 3.33 16.9 19.0 21.7
10 2.56 3.25 3.94 18.3 20.5 23.2
10 2.56 3.25 3.94 18.3 20.5 23.2
11 3.05 3.82 4.57 19.7 21.9 24.7
12 3.57 4.40 5.23 21.0 23.3 26.2
13 4.11 5.01 5.89 22.4 24.7 27.7
14 4.66 5.63 6.57 23.7 26.1 29.1
15 5.23 6.26 7.26 25.0 27.5 30.6
16 5.81 6.91 7.96 26.3 28.8 32.0
17 6.41 7.56 8.67 27.6 30.2 33.4
18 7.01 8.23 9.39 28.9 31.5 34.8
19 7.63 8.91 10.1 30.1 32.9 36.2
20 8.26 9.59 10.9 31.4 34.2 37.6

25 11.5 13.1 14.6 37.7 40.6 44.3


30 15.0 16.8 18.5 43.8 47.0 50.9
40 22.2 24.4 26.5 55.8 59.3 63.7
50 29.7 32.4 34.8 67.5 71.4 76.2
60 37.5 40.5 43.2 79.1 83.3 88.4
70 45.4 48.8 51.7 90.5 95.0 100.4
80 53.5 57.2 60.4 101.9 106.6 112.3
90 61.8 65.6 69.1 113.1 118.1 124.1
100 70.1 74.2 77.9 124.3 129.6 135.8

Quantiles of the -Distribution. These quantiles can be obtained in R


using the qf(p,df1,df2) command.

Visual Summaries D
Descriptive Data Analysis
Summary of Tests for Continuous and Ordinal Variables
Summary of Tests for Nominal Variables
References
Adler, J. (2012). R in a Nutshell . Boston: O’Reilly.

Albert, J., & Rizzo, M. (2012). R by Example . New York: Springer.


[CrossRef][MATH]

Bock, J. (1997). Bestimmung des Stichprobenumfangs . Munich: Oldenbourg Verlag. (in German).

Casella, G., & Berger, R. (2002). Statistical inference . Boston, MA: Cengage Learning.
[MATH]

Chow, S., Wang, H., & Shao, J. (2007). Sample size calculations in clinical research . London: Chapman
and Hall.
[MATH]

Crawley, M. (2013). The R book . London: Wiley.


[MATH]

Dalgaard, P. (2008). Introductory statistics with . Berlin: Springer.

Everitt, B., & Hothorn, T. (2011). An introduction to applied multivariate analysis with . New York:
Springer.

Groves, R., Fowler, F., Couper, M., Lepkowski, J., Singer, E., & Tourangeau, R. (2009). Survey
methodology. Wiley series in survey methodology . Hoboken, NJ: Wiley.
[MATH]

Hernan, M., & Robins, J. (2017). Causal inference . Boca Raton: Chapman and Hall/CRC.

Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. American Statistician , 50 ,
361–365.

Kauermann, G., & Küchenhoff, H. (2011). Stichproben - Methoden und praktische Umsetzung in R .
Heidelberg: Springer. (in German).

Ligges, U. (2008). Programmieren in R . Heidelberg: Springer. (in German).


[MATH]

R Core Team (2016). R: A language and environment for statistical computing . Vienna, Austria: R
Foundation for Statistical Computing. http://www.R-project.org/ .

Young, G., & Smith, R. (2005). Essentials of statistical inference . Cambridge: Cambridge University Press.
[CrossRef][MATH]

Index
A
Absolute
deviation
mean deviation
median deviation
Additivity theorem
Akaike's information criterion (AIC)
Analysis of variance
ANOVA
Arithmetic mean
properties
weighted
Association

B
Backward selection
Bar chart
Behrens-Fisher problem
Bernoulli distribution
Bias
Binomial
coefficient
distribution
Bivariate random variables
Box plot

C
Calculation rules
CDF
expectation
normal random variables
probabilities
variance
Causation
CDF
calculation rules
joint
quantile
quartile
Central limit theorem
Central tendency
Certain event

distribution
variance test
-goodness-of-fit test
-independence test
-test of homogeneity
Coding
dummy
effect
Coefficient
binomial
of variation
regression
Combinations
with order
with replacement
without order
without replacement
Combinatorics
Complementary event
Composite event
Conditional
distribution
frequency distribution
probability
relative frequency distribution
Confidence
bound
interval
interval for ; known
interval for ; unknown
interval for
interval for the odds ratio
level
Consistency
Contingency
coefficient
table
Continuous variable
Convergence
stochastic
Correlation coefficient
of Bravais–Pearson
of Spearman
product moment
Covariance
Covariate
Cramer's V
Cross tabulation, see contingency table
Cumulative
distribution function
frequency, see frequency, cumulative
marginal distribution

D
Data
matrix
observation
set
transformation
unit
Decomposition
complete
Degenerate distribution
Density
Design matrix
Dispersion
absolute deviation
absolute mean deviation
absolute median deviation
mean squared error
measure
range
standard deviation
Distribution

Bernoulli
Binomial

conditional
conditional frequency
conditional relative frequency
continuous
cumulative marginal
degenerate
exponential
F
Gauss
geometric
hypergeometric
independent and identical
joint relative frequency
marginal
marginal frequency
marginal relative frequency
multinomial
normal
Poisson
standard
Student
t
uniform discrete
Duality
Dummy variable

E
Efficiency
Elementary event
Empirical cumulative distribution function (ECDF)
Epitools, see R packages
Error
type I
type II
Estimation
interval
least squares
maximum likelihood
method of moments
nonparametric
parametric
Event
additive theorem
certain
composite
disjoint
elementary
impossible
simple
sure
theorem of additivity
Expectation
calculation rules
Expected frequencies
Experiment
Laplace
random
Exponential distribution

F
Factorial function, see function, factorial
F-distribution
Fisher
exact test
Foreign, see R packages
Frequency
absolute
cumulative
expected
relative
table
F-Test
Function
cumulative distribution, see CDF
empirical cumulative distribution, see ECDF
factorial
joint cumulative distribution
joint probability distribution
probability mass, see PMF
step

G
Gamma
of Goodman and Kruskal
Gauss test
one-sample
two-sample
Generalized method of moments
Geometric distribution
Ggplot2, see R packages
Gini coefficient
standardized
Goodman and Kruskal's
Goodness of fit
adjusted measure
measure
test
Graph
bar chart
box plot
histogram
kernel density plot
Lorenz curve
pie chart
QQ-plot
random plot
scatter plot
Growth
factor
rate

H
Heteroscedasticity
Histogram
Homoscedasticity
Hypergeometric distribution
Hypothesis
alternative
linear
null
one-sided
two-sided

I
i.i.d.
Impossible event
Independence
pairwise
random variables
stochastic
Ineq, see R packages
Inequality
Tschebyschev
Inference
least squares
maximum likelihood
method of moments
Interaction
Intercept
Interquartile range
Interval estimation
J
Joint
cumulative distribution function
frequency distribution
probability distribution function
relative frequency distribution

K
Kernel density plot
Kolmogorov–Smirnov test

L
Laplace
experiment
probability
Lattice, see R packages
Least squares
Life time
Likelihood
Linear
hypotheses
Linear model
residuals
Linear regression
interaction
Line of equality
Location parameter
Log-linear model
Lorenz curve

M
Mann–Whitney U-Test
Marginal
distribution
frequency distribution
relative frequency distribution
MASS, see R packages
Matrix
covariance
design
Maximum likelihood estimation (MLE)
Mean
arithmetic
properties
weighted arithmetic
Mean squared error (MSE)
Measure
dispersion
symmetric
Measure of association
coefficient
contingency coefficient
correlation coefficient
Cramer's
odds ratio
rank correlation coefficient
relative risk
Memorylessness
Method of moments
Model
fitted regression model
fitted value
linear
log-linear
nonlinear
Multinomial distribution
Multiple linear regression
Multiplication theorem of probability
Multivariate
Mvtnorm, see R packages

N
Namibia
Newton–Raphson
Nominal variable
Normal distribution

O
Observation
Odds ratio
One-sample problem
Ordered
set
values
Ordinal variable
Outcome

P
Parameter
location
regression
space
PDF
joint
Percentile
Permutation
without replacement
with replacement
Pie chart
Plot
kernel density
QQ
random
scatter
trumpet
Poisson distribution
Polynomial regression
Population
Power
Probability
calculation rules
conditional
density function
Laplace
mass function
posterior
prior
Probability theory
axioms
-value

Q
QQ-plot
Quantile
Quartile
Quintile

R
R
adjusted
Random variables
bivariate
continuous
discrete
i.i.d
independence
standardization
Range
Realization
Real stories
Reference category
Regression
line
linear
multiple linear
polynomial
Regressor
Relationship
Relative risk
Residuals
standardized
Response
R packages
compositions
epitools
foreign
ggplot2
ineq
lattice
MASS
mvtnorm
ryouready
TeachingDemos
vcd

S
Sample
estimate
pooled variance
space
variance
Sampling
without replacement
Scale
absolute
continuous
interval
nominal
ordinal
ratio
Scatter plot
Set
ordered
unordered
Significance level
Simple event
Slope
Standard deviation
Standard error
Standardization
Standard normal distribution
Statistic
Step function
Stochasticconvergence
Stuart's
Sufficiency
Sure event

T
Table
contingency
frequency

of Stuart
T-distribution
Test
ANOVA
Binomial
for variance
of homogeneity
-goodness of fit
-independence
duality
equivalence
F
Fisher
Friedman
Kolmogorov–Smirnov
Kruskal–Wallis
Mann–Whitney U
McNemar
Mood
one-sample Gauss test
one-sample -test
one-sided
overall F
paired -test
sign
significance
two-sample binomial
two-sample Gauss test
two-sample -test
two-sided
U
Welch
Wilcoxon–Mann–Whitney
Wilcoxon rank sum
Theorem
additivity
additivity of variables
additivity of disjoint events
Bayes
central limit
i.i.d.
large numbers
law of total probability
multiplication for probabilities
Neyman–Fisher Factorization
PDF
standardization
Student
Tschebyschev
variance decomposition
Tie
Transformation
T-test
one-sample
paired
two-sample
Two-sample problem

U
Uniform distribution
continuous
discrete
Unit
Unordered set
U test

V
Variable
binary
bivariate random
categorical
continuous
dependent
discrete
dummy
grouped
independent
nominal
ordinal
random
response
standardized
Variance
additivity theorem
between classes
calculation rules
decomposition
dispersion
pooled
within classes
Vcd, see R packages
W
Welch test
Whiskers
Wilcoxon–Mann–Whitney test

You might also like