Collection of exercises - Basic statistics
Collection of exercises - Basic statistics
Collection of
Exercises in
Basic Statistics
Basic statistics: binomial and normal distribution, sampling
distributions, hypothesis testing, confidence intervals, t-test,
proportion test, chi-square test …
2
Exercises in statistics
General rules:
- Always use 5% significance levels and 95% confidence levels if nothing else is said.
- Always give conclusions in words that are connected to the questions topic
A. Descriptive statistics
Exercise A.1
For each of the following variables indicate on which scale they are measured and, if relevant, if they
are discrete or continous.
a. Colesterol in serum
b. Receiver of social benefit (yes or no)
c. Number of biological children
d. Highest educational level (secondary school, high school, university)
e. Scores on an exam (minimum 0, maximum 40)
f. Income
g. Importance of sustainability aspects when choosing a product (not at all important, not
important, important, very important)
Exercise A.2
In two study programs at a university a course evaluation is conducted. One of the questions is: “Are
you satisfied with the quality of your study program?” In the economics program 60 students were
asked and 42 answered the evaluation: 24 answered that they were satisfied. In the education
program 50 students were asked. The number of respondents was 45 and 30 said they were
satisfied.
b. Teachers at the economics program said that the percentages in a. could not be compared as
fewer economics students answered the evaluation. They also claim that it would be possible
that economics students are at least as satisfied as the students in the other program if the
number of respondents would have been higher. Are they right?
Exercise A.3
In a switchboard the number of incoming telephone calls are counted during 30 time periods of 2
minutes each. The result is the following:
3
6 3 6 4 5 2 7 3 5 4 5 3 5 2 4 3 1 5 2 4 5 4 5 3 2 7 2 5 4 6
Exercise A.4
The number of coliforms (on a log-transformed scale) were counted in milk samples.
3.9
5.3
6.1
4.9
9.1
2.8
3.5
3.2
2.6
5.9
a. Compute the mean value and the variance for the sample.
b. Compute the median in the sample.
Exercise A.5
On the course homepage you will find the file cordblood.txt, which contains data from serological
blood samples from the umbilical cord at 369 births in Sweden in 2007. The main purpose of the
study was to investigate the immunity against four common diseases in Sweden: measles, parotitis,
rubella and chicken pox. The variables in the file are the following
a. Read the file into R. Check that all variables have the right format: numerical (continuous) or
factors (categorical). Choose one of the antibody variables to continue.
b. Make a histogram and a boxplot for the chosen antibody variable. Does the distribution of
the variable appear to be symmetric or skewed?
c. Calculate mean, median, variance and interquartile range for the antibody variable. Is it more
appropriate to use mean and variance or median and interquartile range in this case? Why?
4
d. Make a scatter plot for the antibody variable against the age of the mother. What do you
observe?
e. Make a scatter plot with two antibody variables against each other.
f. Produce boxplots of the antibody variable split up into observations from different hospitals.
Exercise B.1
In a package of tomato seeds 80% germinate. Assume that one seed is selected randomly from the
package.
Exercise B.2
In a package with tomato seeds 85% germinate. Assuome that you randomly select 8 seeds from the
package.
Exercise B.3
Among Swedish businesses 72% have a sole proprietor (only one person owns and works in the
business). 24% have between 1 and 9 employees and the remaining 4% have more employees. To
testa a questionnaire a pilot study is made. The questionnaire is send out to 10 randomly selected
businesses in Sweden1.
a. How high is the probability that all 10 businesses have a sole proprietor?
b. How high is the probabililty that at most 8 businesses have a sole proprietor?
c. How many big businesses (with more than 9 employees) can you on average expect in such a
random sample of 10 businesses?
d. How many big businesses (with more than 9 employees) can you on average expect in a
random sample of 100 businesses?
1
Using the Statistics Sweden’s business register as sampling frame, https://www.scb.se/en/services/statistics-
swedens-business-register/
5
Exercise B.4
The probability that a certain type of bulb lights more than 1000 hours is 0.9. Three such bulbs are
connected to a lighting system, but work independently from each other. Determine the probability
that after 1000 hours:
Exercise B.5
The normally distributed random variable X has a mean of 50 and a standard deviation of 10.
Compute
a. P(X ≤ 65)
b. P(X < 25)
c. P(X > 35)
d. P(X > 70)
e. P(40 < X < 60)
If X is N(0,1). Determine
f. P(X ≤ 1.82)
g. P(X ≤ -0.35)
h. P (-1.2 < X < 0.5)
Exercise B.6
To study the costs of sick leave for small businesses we define the population to be all small
businesses (less than 9 employees) in Sweden The variable of interest is the cost for sick leave (in
thousands of crowns, tkr) during a certain period. We believe that the average cost (µ) is 60 tkr per
business and that the standard deviation (𝜎) is 16 tkr. The random variable “cost of sick leave” is
denoted X.
a. An independent random sample (simple random sample, SRS) with n = 400 businesses are
selected from a very large population and the sample mean 𝑥̅ is computed. What can be said
about the distribution of 𝑥̅ (i.e. the sampling distribution) according to the central limit
theorem?
b. What is the expected value (average value) and the standard deviation in the sampling
distribution for 𝑥̅ ?
c. How high is the probability that the observed sample mean lies between 58.43 and 61.57?
d. If the assumption we made about the average cost (µ) and the standard deviation (i.e. 60 and
16) would be wrong: how does that change the answers in a-c?
6
Exercise B.7
A business sells 100g-packages of salted nuts. We assume the the package weights are distributed
according to N(100, 1.5), i.e. a normal distribution with expected value 100 gram and a variance of
1.5 gram.
b. How high is the chance that a randomly selected package from the box weights less than 99
gram.
c. If the business owner selects two packages at random: how high is the probability that both
weight less than 99 gram?
d. In a box with 1200 packages: how many of the packages are expected to weigh less than 99
gram?
Exercise B.8
The business selling salted nuts wants to control the production process with random tests. Packages
are stored in boxes with 1200 packages2. The control process looks like this: From each box 10
packages are collected at random and their average weight 𝑥̅ is computed. If 𝑥̅ deviates from the
expected population mean µ (100 g) with more than 1 gram the entire box is removed and the
packaging process is adjusted.
a. How high is the probability that 𝑥̅ deviates with more than 1 gram if the individual
packages can be assumed to be independent and the weights are distributed according
to N(100, 1.5).
b. How high is the probability that 𝑥̅ deviates with more than 1 gram from 100 gram if the
individual packages can be assumed to be independent, but the weights are distributed
according to N(99.5, 1.5).
Exercise B.9
Beets are filled in cans with 1000 gram per can. The standard deviation for the filled cans is usually
about 8 gram.
a. Approximately how large proportion of the cans weigh at least 1010 gram?
b. If you randomly select a can, how high is the probability that it weighs less than 995 gram?
Exercise B.10
The length of a randomly selected child of age 4 can be assumed to be normally distributed with a
mean of 100 cm and a standard deviation of 7 cm.
2
The box size is not directly used in computations here, but can be useful to determine if the sample size is a
small or large part of the population (which here is the 1200 packages in the box). If the sample constitutes a
large part of the population a ‘finite population correction’ can be useful.
7
b. In a survey a sample of 100 children is taken. How high is the probability that the mean
length of these lies between 99 and 102 cm?
Exercise B.11
From a week of production of cans with beets a SRS of 125 cans is taken. If their mean weight
deviates from 1000 gram with more than 3 gram the production process is stopped and controlled.
a. If it is true that the population of cans on average weighs 1000 gram and the standard
deviation id 8 gram, how high it the probability to get a sample mean for the 125 cans that
deviates with more than 3 grams from 1000 g?
b. If, in fact, the population of cans on average weighs 998 gram and a standard deviation of 16
gram, how high is the probability that the sample mean deviates with more than 3 gram from
1000 g?
Exercise B.12
A dog breed suffers of a hereditary (passed on by defective genes) disease XY with a probability of
0.15. During one week 10 dogs of this breed come to your clinic.
a. What is the probability that none of these dogs have the disease?
b. What is the probability that at least 2 dogs have the disease?
c. After the week is over and you look at your patients list and realize that 5 of the dogs come
from the same kennel/breeder/family. Does that influence the computations you made in a)
and b)? Why? (Do not make any new calculations).
Exercise B.13
The prevalence of M. paratuberculosis infection, i.e. the risk that a randomly selected macropod is
infected, is believed to be around 2% (Mycobacterium paratuberculosis and kangaroos). Assume that
a random sample of 30 macropods is examined.
a. What is the probability that exactly one macropod will be infected?
b. What is the probability that at least one macropod will be infected?
c. Assume that all 30 macropods were examined, how many infected animal would you in
average expect in the sample?
If we want to compute probabilities for large groups, we usually use a normal approximation to the
distribution above.
d. Compute the probability to observe at least 20 infected macropods among 600 if the
infection rate is again 2%.
Exercise B.14
The population of clients in a online clothing store consists to 20% of parents of young children. The
business wants to send out a questionnaire to a sample of clients to investigate in the interest to use
a new app, which would simplify online orders. In a first trial the store sends out 7 questionnaires to
test questions and questionnaire routines.
8
a. How high is the probability that as least 4 of the clients that get the questionnaire are
parents of young children?
b. How high is the probability that none of the clients that get the questionnaire are parents of
young children?
Assume that the sampling frame used is the clients’ register of the clothing store (i.e. the clients that
have accepted to become a member).
e. In which way does this sample frame lead to coverage errors?
Exercise C.1
The cost of sick leave (in thousands of crowns, tkr) during a certain time period for a population of
small businesses in Sweden is investigated. A sample of n =1400 businesses is drawn from this very
large population. The mean cost for sick leave in the sample 𝑥̅ is 61.5 tkr. Standard deviation is 26
tkr.
Exercise C.2
Measurements of pH in a lake are made on the same day. From earlier studies it is known that the
uncertainty3 of the measurement can be quantified with a standard deviation of 0.05. The estimate
can be assumed to be unbiased. Therefore the uncertainty of the estimates can be described as an
error term with mean 0 and standard deviation 0.5.
a. Assume that 4 measurements are made and that the mean value is computed for these.
Compute the standard deviation for this mean value, the so-called standard error of the
mean (SEM).
b. Now, the researcher conducting the study received additional funds and want to increase the
number of meansurements to ensure that the standard error of the mean (SEM) is not larger
than 0.01. How many measurements must be made?
Exercise C.3
A survey is made among Swedish 12-year-olds to determine how much they watch TV during an
average week.
a. A SRS is made with 500 12-year-olds in the sample. The mean of hours watched is 15.9 hours
and the standard deviation is 9.5. Compute a 95% confidence interval for .
3
This uncertainty can include spatial variability, instrumental/analytical errors and maybe observer
uncertainty, i.e. if differences arise if different persons collect the samples.
9
b. How many observations are needed in order to get a confidence interval that is at most 1
(hour) wide?
c. To use the arithmetic mean is one way to summarise the data. Suggest some others.
Exercise C.4
A store is interested to know how much first-year students spend on clothes during the first month
of the term. A random sample of 9 students give the following expenses in crowns:
a. Compute a 95% confidence interval for X, the average expense for cloths in the population
(= among first-year students).
During the next year more time and money is put on this study and a sample of 50 first-year students
is used. The sample mean is 366 and the standard deviation 120 crowns.
Exercise C.5
The number of coliforms (on a log-transformed scale) were counted in milk samples.
3.9 5.3
6.1 4.9
9.1 2.8
3.5 3.2
2.6 5.9
a. Compute a 95% confidence interval for the mean of (log-transformed) counts of coliforms
assuming normal distribution using R. Interpret the results.
b. Outliers can be a problem in statistical analysis. A value 9.1 is observed here which is much
larger than other observations. What would the effect or removing this observation be on
the analysis? Consider first and then redo the confidence interval with the observation 9.1
removed4 (use R).
4
Remember that in reality we should never remove observations unless we have a good motivation to do so,
e.g. that the measurement instrument had a failure or a person made an error in measuring. Also, if the value is
completely unreasonable it can be removed, e.g. a weight of -3.
10
D. Confidence intervals for proportions
Exercise D.1
In a report from Public Health England on Campylobacter contamination in fresh whole UK-produced
chilled chickens the following table is given. The sample size is given in the first column and number
of positive samples in the other columns labelled n.
a. Verify the computed confidence interval for the proportion of free range chicken that reveal
more than 1000 sfu of Campylobacter spp. Use normal approximation to do this (in the
article Clopper-Pearson exact method is used, which means that results can deviate a little).
b. Use the exact binomial test in R to verify the same confidence interval as in a.
Exercise D.2
Researchers believe that breeding increase the prevalence of an innate disease ZZ in dogs. A study
was made 30 years ago and the prevalence was estimated to be 0.06, but no other information on
the study design or data collection was available. In a new study 479 dogs are randomly selected. 34
dogs were tested positive for the ZZ disease.
a. Compute a confidence interval for the proportion of dogs that have disease ZZ based on the
data from the new study.
b. Can the confidence interval be used to determine if there is a significant difference between
the prevalence now and 30 years ago? How?
An insurance company wants to estimate the prevalence of the disease with a higher accuracy, i.e.
lower the uncertainty of the estimate. For this they decide to make a new study and compute a 95%
confidence interval.
c. If they want the confidence interval to be no broader than 0.01: How many dogs do they
need to examine?
11
E. Tests for mean values
Exercise E.1
The threshold value for mercury in fish lies at 0.5 mg/kg. In a certain lake researchers want to
investigate if the mercury levels in European perch lies significantly above this threshold value. For
this 23 fish are analysed and a mean of 0.65 and a standard deviation of 0.15 is registered.
b. Determine if the average mercury concentration lies significantly above the threshold value5.
c. How high must the observed mean in the sample be in order to reject the null hypothesis?
d. How does the sample size influence the conclusions drawn in b.? Say, that an industry is
responsible for sampling in the lake. How can they benefit from statistical uncertainty in this
case to avoid taking measures?
Exercise E.2
The weight of packages of salted nuts is controlled by the producing business. The expected weight
of an individual package is 100 gram. A box contains 1200 packages of which 10 are randomly
selected. The sample gives a mean weight of 99 gram and a standard deviation of 1.7 gram. The
package weights can be assumed to be normally distributed.
a. Test if the package weights on average deviate significantly from 100 gram. Use a 5%
significance level.
b. How would the analysis and the results change if instead a sample of 30 packages is used?
c. How high is the risk for the error of first kind (type I error) in a and b, respectively?
d. What is the error of second kind (type II error)? Will it change if the sample size changes?
Motivate, do not make any computations.
Exercise E.3
The cost of sick leave is investigated with a sample of 1400 small businesses. The mean cost in the
sample is 𝑥̅ = 61.5 and the standard deviation is 26.
a. Test the hypothesis that the population mean is 60 tkr against the alternative hypothesis
that deviates from 60. Use a 5 % significance level.
One-Sample T
Test of μ = 60 vs ≠ 60
5
When conducting a test like in b. it is sometimes called a benefit-of-doubt test as we do not reject the null
hypothesis unless it lies significantly above the threshold value.
12
N Mean StDev SE Mean 95% CI T P
1400 61.500 26.000 0.695 (60.137, 62.863) 2.16 0.031
Identify the p-value. How can you use the p-value to make the same conclusion as in a.?
c. How would the computations in a) change if you want to test the alternative hypothesis that
the cost of sick leave is larger than 60 tkr?
Exercise F.1
A prototype of an app is published by an online store and a survey is made among its customers. The
clients area sked if the app would simplify their online shopping and the following alternatives can be
chosen from: ‘Strongly Agree’, ‘Agree’, ‘Disagree’, ‘Strongly Disagree’’. The company has decided that
they will only go further with the development of the app if at least 60% of the clients answer
‘Strongly agree’ on this question.
a. The questionnaire is send out to 185 clients and 113 answer ’Stronly agree’. Test on a 5 %
significance level if the proportion of clients that answer ‘Strongly agree’ lies over 60%.
Exercise F.2
In a middle-sized city a census was made 10 years ago concerning smoking habits of 18-year-olds. It
was found that 30 % smoked regularly. This year the study is followed-up using a survey to see if the
proportion of smokers has changed. Among 2500 18-year-olds 400 were selected randomly. 96 of
these smoked regularly.
a. Test on a 5 % significance level if the proportion of persons smoking regularly has changed
compared to 10 years ago.
b. Compute a 95% confidence interval for the proportion of 18-year-olds that smoke now. Can
this interval be used to conduct the test in a?
Exercise F.3
Researchers believe that breeding increase the prevalence of an innate disease ZZ in dogs. A study
was made 30 years ago and the prevalence was estimated to be 0.06, but no other information on
the study design or data collection was available. In a new study 479 dogs are randomly selected. 34
dogs were tested positive for the ZZ disease.
a. Is the prevalence of this innate disease significantly different now compared to 30 years ago?
13
G. Test for means in two samples
Exercise G.1
In large municipality the goal of a survey is to investigate the number of sick leave days for man and
women. Two independent samples are drawn, one containing men and one women. For each of the
samples the number of sick days over a long period were registered. The following results were
obtained:
a. Test the hypothesis that the average number of sick leaves is the same between men and
women. Use a double-sided alternative hypothesis.
b. In this study the sample was drawn randomly among all men and women employed by the
municipality. Name some problems that can arise with this approach. Suggest another way
to conduct the survey.
Exercise G.2
In a study comparing fishing nets two different products are studied. During different months these
two nets are used for fishing in the same area and the same time interval and the number of fish
caught are registered. The catch is given in average catch per day (in kg) for eight different months:
a. Test on a 5 % significance level if there is a difference in the average catch for the two fish
nets.
b. The test you conduct in a. should be a test for paired samples. Why? Why is a paired test
more efficient than to compare the means of all eight month with each other, i.e. assuming
independent samples?
Exercise G.3
To compare fish sizes for different years some researchers have sampled Atlantic Cod during 1985 (15
fish) and again 2005 (17 fish). In both years the length of the fish was noted (among other variables).
The results for the study are given below as length in cm:
1985 2005
120 105 115 96
148 83 63 93
70 152 126 110
123 184 170 92
154 102 73 109
118 179 65 154
98 134 58 113
167 120 97
125
14
∑ 𝑥𝑖 1937 ∑ 𝑥𝑖 1779
a. Compute mean and median for Atlantic cod for the two years, respectively.
b. Determine if the length of Atlantic cod differs significantly for these two years. Assume that
the variances are equal and that the data is normally distributed.
c. The Wilcoxon rang sum test could also be conducted and would give the results below.
Interpret the results:
d. What is the difference between conducting a t-test and a Wilcoxon rang sum test? In which
situations would you choose one over the other?
Exercise G.4
The effect of a treatment to lower blood pressure in cats is examined. The blood pressure of 8 cats is
measured on day 1. During day 2 to day 14 the cats receive amlodipine (a medicine) once daily. The
blood pressure is again measured on day 15 in the study. The following results are noted and
observations can be assumed to be normally distributed:
The goal of the medication is not only a decrease in blood pressure but to get a healthy level of the
blood pressure.
b. Determine a 95% confidence interval for the blood pressure on day 15. Interpret the results.
c. Assume that a healthy blood pressure for cats lies in the interval 130 to 160. Is the goal of a
healthy blood pressure levels achieved with this medication? No computations are needed,
discuss and motivate from data and prior computations.
15
Exercise G.5
The weight of eggs (in gram) is studied in a farm and for this a sample of 23 eggs is taken. Egg weights
can be assumed to be normally distributed.
50 43 39 46
58 62 45 48
51 37 68 58
56 41 42 52
54 68 57 43
49 58 53
a. Compute the median and the 25th and 75th quantile for the egg weights.
b. Compute a 99% confidence interval for the egg weights. Interpret the interval.
c. After this study was done the farm started with a new diet for chickens in hope that the eggs
might increase in size and weight. 2 years later a new study was made and 25 eggs were
measured. It resulted in a mean egg weight of 55.65 and a standard deviation of 8.73. Compare
the mean egg weights from the two studies using an appropriate statistical test at the 5%
significance level.
d. In this assignment, it was given that we can assume weights to be normally distributed. If that
information was not given: How would you approach the problem?
Exercise G.6
For the cordblood.txt data we want to conduct hypothesis tests to see whether or not male and
female babies have the same level of measles antibodies. The distribution of this data is rather
skewed, i.e. not normal. On the other hand we have quite many observations and the distribution of
the mean should be approximately normal according to the Central limit theorem. We have here (at
least) 3 different options (a-c):
a. Run a two sample t-test in R assuming that the mean value is normally distributed due to the
large sample size
b. Run a two sample t-test on log-transformed data. Observations will after log-transformation
be close to a normal distribution
c. Run a Wilcoxon rank sum test, which uses the ranks of data and does not require normal
distribution.
d. Compare the results from the three approaches. Do you prefer one of these? Why?
e. Are there any significant differences in measles antibodies for male and female babies? How
high is the difference?
f. Check if the distribution of the log-transformed number of antibodies for Measles closer to a
normal distribution than the original data was (use histogram, boxplots or qq-plots).
Comment.
16
Exercise G.7
Use again the cordblood.txt data.
a. Create age classes for the age of the mother with approx. 5 year intervals starting from 15
(use e.g. the case_when statement in mutate)
b. Test if there is a significant difference in Measles antibody levels for babies from mothers
aged 20-24 compared to mothers aged 35-40. (Choose either log-transformed data or non-
parametric tests as you like). Draw conclusions from the test.
Exercise H.1
In a survey targeted against people who want to stop smoking the most common nicotin
replacements are studied: Nicotin patches and Nicotin gums. The following results were obtained
6 month after the start of the study:
a. Determine with a hypothesis test on the 5 % significance level if there is a difference in the
results of the two methods.
Exercise H.2
In the WHO report Insecticide-treated nets and malaria prevalence, Papua New Guinea, 2008–2014
the prevalence of malaria is studied during different time periods and in villages below 1600 m in
altitude. The following table is presented6:
6
Before presenting data age standardisation was conducted to make the different surveys comparable. Find
some more details on this here https://en.wikipedia.org/wiki/Age_adjustment in case you are interested.
17
Assume that the observations made are independent.
a. Test if the prevalence of malaria of all species is significantly different when comparing 2008-
2009 and 2010-2011. Observe that you might need to determine the number of cases from
the percentage to make all computations.
b. In the report it is stated that “Nationally, in villages below 1600 m in altitude, the age-
standardized prevalence of malaria, as diagnosed by light microscopy, decreased significantly
from 11.1% (95% confidence interval, CI: 8.5–14.3) in 2008–2009 to 5.1% (95% CI: 3.6–7.4) in
2010–2011 (P < 0.001) …”. Use R to conduct the test in a. and determine the p-value from
your output. Is it similar to the one given in the article?
c. Try to verify the confidence interval of the prevalence of malaria during 2010-2011. Use R or
hand calculations. Hint: You might not get the same results here.
d. When describing the background of the data it is stated: “In both surveys, five villages were
randomly selected from each of the country’s 20 provinces – organized in four regions –
using a list of villages identified in the 2000 national census – the most up-to-date.11 Not all
provinces or selected villages could be included because of problems with access and
security”. How does this information influence the choice of method and results in a.,b. and
c.?
Exercise H.3
In a large study two different climatic areas are examined to determine the proportion of diseased
trees. We found that in
a. Determine if the proportion of diseased trees is significantly different in the two forests.
c. When you meet the researcher responsible for collecting the data you ask how the sampling
design looked like. This is the answer you get: “We has a list will all forests in the climatic area
A, from this we sampled one forest and collected samples in this forest from 532 trees.
Similarly for climatic area B. Is this a good study? Why/why not? Suggest a better way of
collecting data for the study. What if you need to consider keeping within a (tight) budget –
would you change the design?
18
I. Chi-square tests and contingency tables
Exercise I.1
In the WHO report Low-risk planned caesarean versus planned vaginal delivery at term: early and
late infantile outcomes on risks of neonatal morbidity in Iran, the following table with characteristics
of the mothers is presented:
a. Verify the conducted test for the maternal education level. Use a chi-square test.
b. Are the conditions for a chi-square test in a. fulfilled?
c. Conduct both the chi-square test and the Fisher’s exact test in R and check how results are
affected. Which test would you choose?
Exercise I.2
280 random inhabitants are chosen in a city and asked about traffic disturbance. The following
results were collected:
a. Test the hypothesis that the opinion about traffic disturbance is dependent on age. Use a 5%
significance level.
b. Compute a 95% confidence interval for the proportion of inhabitants that are not disturbed.
19
Exercise I.3
In a study of trees 4 different climatic areas were investigated and it is noted how affected the trees
are on a scale with 3 levels (healthy, slightly affected, diseased)
a. Test if there is an association between the climatic areas and the health status of the trees
with an appropriate test. Use R.
b. How many degrees of freedom does the appropriate chi-square distribution have?
Exercise J.1
In the report ’Prevalence and genetic diversity of Campylobacter spp. in environmental water
samples from a 100-square-kilometer predominantly dairy farming area’ you can find the following
table 2 (only a part is shown).
a. Consider the first test analysis if the type of water source influences the presence/absence of
campylobacter. Note that you might have to compute counts from the available information
before you can make the necessary computations for the test. Conduct an appropriate test
20
to determine if there is a relation between water source and presence/absence of
campylobacter. Draw conclusions.
b. Redo the test in a. using R and verify the p-value given in the table above.
Exercise J.2
In an industry soup cans are produced. The weight of the cans, including soup, should be normally
distributed with a standard deviation of 12 gram and a mean of 475. A sample of 25 cans is taken and
their mean is computed to be 481 grams.
a. Test at a 5% significance level if the mean weight of the cans deviates significantly from the
goal weight of 475 grams.
b. What would the answer to a. be if the sample only consisted of 10 cans, but both mean and
standard deviation are unchanged?
Exercise J.3
A market research institute makes a survey to investigate if ’systembolaget’ should be abolished. A
sample of 1000 persons is drawn from the Swedish population over 18 years. The following
statement is put:
Also the respondent’s gender is noted and the results is the following:
A year later the same question is asked in a survey but several alternatives were allowed as answers.
The following results are obtained:
21
Exercise J.4
Two production plates of sausages (A and B) are compared regarding the moisture contents (in %)7.
For plate A 30 sausages are sampled, but for plate B only 25 could be measured. The following results
were obtained:
x A 2067 x B 1902
a. Test on the 1 % significance level if there are any differences in moisture contents between
production plate A and B.
b. Compute a 99 % confidence interval for the average moisture contents of sausages on plate
B.
c. The moisture content we analyse is a % value. We have earlier learned that proportions are
analysed by specific proportion tests, not using mean values. Why do we use mean values
here?
Exercise J.5
A veterinary wants to investigate if the method of insemination influences the gender of the foal. We
collect the following data:
a. Test if the proportion of fillies is significantly different from 0.5 when natural mating is used.
b. Test if there is a relation between the method of insemination and the gender of the foal.
Exercise J.6
A company that sells cell phones has a market share of 22.8%. Using advertising and bringing
forward new products the company hopes to increase the market share. After a month they evaluate
the advertising strategy and find that 4130 of a total of 17500 new subscriptions were bought from
the company.
a. Test if there was a significant increase in the market share after advertising.
b. Compute a 99% confidence interval for the market share.
7
https://blog.kett.com/bid/362219/moisture-content-vs-water-activity-use-both-to-optimize-food-safety-and-
quality
22
Exercise J.7
From a large population of animals that have a specific disease it is known that 60% of all animals
recovery without treatment. The remaining animals must be treated once or several times. The
probabilities for recovery are given below for each number of treatment.
a. A random sample of n=4 animals is drawn. How high is the probability that none of the four
animals needs treatment?
b. A random sample of n=12 animals is drawn. How high is the probability that none of these 12
animals need a treatment?
c. How high is the probability that two or more of the 12 animals need a treatment?
Exercise J.8
Is there a relation between age and the consumption of meat? In a research study 550 persons were
asked if the eat meat or not:
a. Test on the 5% significance level if there is a relation between age and meat consumption.
b. Compute a 95% confidence level for the proportion of persons 18-25 years old that do not
eat meat.
Exercise J.9
In a telephone switchboard incoming calls were registered during 39 time periods, each 5 minutes
long. This is the results:
12 16 24 13 8 15 22 18 14 10
18 26 17 12 11 17 21 18 16 5
10 17 22 15 9 13 25 14 18 14
7 12 25 22 16 21 15 22 17
23
In an approach to make draw more conclusion from the data all time periods were classed A or B,
where A is night time and weekends and B is daytime during working days. The information is given
below in the same order as data above.
A B A A A B B A B A
A B A A B B B A B A
B B A A A B B B B B
A A A B B A A A B
c. Redo the visualisation of the data, now split up into the two categories A and B. Make one
visually appealing graph.
d. Again compute mean value and variance, now split up into the two categories.
Exercise J.10
Sia and her supervisor plan an upcoming study on blood values of dogs. The interest is to understand
the effect of two different diets in combination with two very specific training programs. The
supervisor explains that there are four dogs available for the study and that they therefore will
randomise one dog to each combination of diet and training. They will use t-tests to compare results
for different combinations with each other.
a. Which arguments should Sia use to explain to her supervisor that it is not possible to make
statistical decisions based on one observation per combination?
After hearing Sias arguments the supervisor is still sceptical. “Well”, he says, “let’s just run the
computations in R and we can discuss later how to describe them”.
b. How should Sia explain that it is mathematically impossible to do the calculations and,
therefore, no computations can be made in R?
“Ok”, the supervisor says. “Here is what we can do: Just use the four samples that we take from the
dogs and rerun the blood analysis a couple of times. Then we have additional observations and we
can run the analysis.
c. How are such replicates called? Which variation can be quantified by them and which
variation cannot? Should Sia be satisfied with this? Would it help to make additional blood
tests at different time point for each of the four dogs to gain replicates?
24
Exercise J.11
A producer of bags wants to know how large weight can be carried in them. 100 bags are selected at
random from the production and the carried weight (x, in kg) is noted when the bag bursts. The
following sums were computed from data:
x =2 028 x 2
=42 798
a. Compute a 95% confidence interval for the weight the bags can carry.
b. Normal distribution plays an important roll when computing the interval in a. What need to
be normally distributed and how can that be checked or verified?
Exercise J.12
a. A data material is collected the following way. 30 different forests with Scots pine are
selected. They lie very far from each other and are randomly spread over Sweden. From each
forest you randomly select 20 trees to observe the needle density. Assume that you would
like to make a confidence interval for the mean needle density of Scots pines in Sweden.
How would you approach the problem? Describe shortly the different steps you would make
in the analysis, but do not do any computations.
b. When checking a data material for its properties you get the following plot. What is it used for
and which conclusion would you draw from this specific plot?
Exercise J.13
When fishing salmon in a large lake the probability of catching a fish that is longer than 1.5 meters is
0.35. If Peter and Sally are out fishing at this lake and get in total 6 fish what is the probability that
Another day Peter and Sally are fishing with a net on a boat on the same lake. They catch 40 fish.
c. What is the probability that at least 15 are 1.5 meters or longer? Use a normal approximation.
25
Exercise J.14
A questionnaire was sent out to 5400 owners of agricultural farms. Among other questions the
following two questions are asked:
“How important are sustainability/environmental topics in your work”: with answers: Very Important
• Important • Moderately Important • Slightly Important • Not Important
“How often do you spray pesticides? With answers: Never • Less frequent than my adviser’s
recommendation• According to my advisors recommendation • More frequent than my adviser’s
recommendation
a. Which test would be appropriate to perform an analysis for the collected data? If there are
several options describe the difference between them and recommend one. Which would be
the null hypothesis of the recommended test?
b. What are the assumptions for the test you recommended in a). List all, even if you already
have commented them in a).
Exercise J.15
The average weight of eggs is compared for two breeds. For the first breed (AG) 15 eggs were chosen
randomly and for the second breed (BH) 17 eggs were chosen. See R output below.
a. Is there a significant difference between the mean egg weights for the two breeds?
Motivate!
b. Write down the null and alternative hypothesis of the test.
Results for R:
Two Sample t-test
data: AG and BH
t = 1.8168, df = 30, p-value = 0.07926
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.460199 42.099415
sample estimates:
mean of AG mean of BH
124.4667 104.6471
Exercise J.16
In a stable we have a large group of sheep. Of these sheep 5% have the disease GHG which is hard to
diagnose. For an experiment you would prefer healthy sheep, but cannot afford to test the sheep.
Instead you simple select random sheep and hope they might be healthy.
a. If you need 5 sheep: How large is the probability that at least two sheep are sick?
b. If you need 5 sheep: How large is the probability that more than two sheep are sick?
c. If you need 10 sheep: How large is the probability that all sheep are healthy?
d. If you need 70 sheep: How large is the probability that more than 8 sheep are sick? Use an
appropriate normal approximation.
26
Datasets in this collection of exercises
Cordblood.txt
27