Chapter 1 Introduction
Chapter 1 Introduction
Basic Biostatistics
Wullo S. (MPH)
7/9/21 1
Chapter One
1.1 Introduction to Biostatistics
07/09/2021 2
Definition and classification of Biostatistics
07/09/2021 3
Classification of Biostatistics
Descriptive biostatistics
A statistical method that is concerned with the collection,
organization, summarization, and analysis of data from a
sample of population.
Inferential biostatistics
A statistical method that is concerned with the drawing
conclusions/inference about a particular population by
selecting and measuring a random sample from the population.
07/09/2021 4
Cont…
B io s t a t is t ic s
D e s c r ip t iv e S t a t is t ic s I n f e r e n t ia l S ta t is t ic s
c o lle c t io n m a k in g in f e r e n c e s
o r g a n iz in g h y p o t h e s is t e s t in g
s u m m a r iz in g d e t e r m i n i n g r e l a t i o n s h ip
p r e s e n t in g o f d a ta m a k in g t h e p re d ic t io n
07/09/2021 5
Descriptive Biostatistics
07/09/2021 6
Inferential Biostatistics
07/09/2021 7
1.2 Stages in statistical investigation
There are five stages or steps in any statistical investigation.
1. Collection of data
The process of obtaining measurements or counts.
2. Organization of data
Includes editing, classifying, and tabulating the data
collected.
3. Presentation of data:
overall view of what the data actually looks like.
facilitate further statistical analysis.
Can be done in the form of tables and graphs or diagrams.
07/09/2021 8
Cont…
4. Analysis of data
To dig out useful information for decision making
It involves extracting relevant information from the data
(like mean, median, mode, range, variance…),
5. Interpretation of data
Concerned with drawing conclusions from the data
collected and analyzed; and giving meaning to analysis
results.
A difficult task and requires a high degree of skill and
experience.
07/09/2021 9
1.3 Definition of Some Basic terms
07/09/2021 11
Cont...
Sampling: The process or method of sample selection from the
population.
Sample size: The number of elements or observation to be
included in the sample.
variable is a characteristic or attribute that can assume different
values in different persons, places, or things.
Some examples of variables include:
Diastolic blood pressure,
heart rate, heights,
The weights
Data: Refers to a collection of facts, values, observations, or
measurements that the variables can assume.
07/09/2021 12
Uses of statistics:
07/09/2021 13
Limitations of statistics
Deals with only aggregate of facts and not with individual data
items.
Statistical data are only approximately and not mathematical
correct.
Statistics can be easily misused and therefore should be used
be experts.
07/09/2021 14
1.5 Types of Variables and Measurement Scales
A variable is a characteristic or attribute that can assume
different values in different persons, places, or things.
Examples :
age,
diastolic blood pressure,
heart rate,
the height of adult males,
the weights of preschool children,
gender of Biostatistics students,
marital status of instructors at University of Gondar,
ethnic group of patients
07/09/2021 15
A. Depending on the characteristic of the measurement, variable can be:
Qualitative(Categorical) variable
A variable or characteristic which cannot be measured in
quantitative form but can only be identified by name or categories,
for instance place of birth, ethnic group, type of drug, stages of
breast cancer (I, II, III, or IV), degree of pain (minimal, moderate,
sever or unbearable).
The categories should be clear cut, not overlapping, and cover all the
possibilities. For example, sex (male or female), vital status (alive or
dead), disease stage (depends on disease), ever smoked (yes or no).
07/09/2021 16
Quantitative(Numerical) variable:
is one that can be measured and expressed numerically.
Example: survival time, systolic blood pressure, number of
children in a family, height, age, body mass index.
they can be of two types
Discrete Variables
Have a set of possible values that is either finite or
countabl infinite.
The values of a discrete variable are usually whole
numbers.
Numerical discrete data occur when the observations are
integers that correspond with a count of some sort.
07/09/2021 17
Some common examples are:
Number of pregnancies,
The number of bacteria colonies on a plate,
The number of cells within a prescribed area upon microscopic
examination,
The number of heart beats within a specified time interval,
A mother’s history of numbers of births ( parity) and
pregnancies
The number of episode of illness a patient experiences during
some time period, etc.
07/09/2021 18
Continuous Variables
07/09/2021 19
Con…
Observations are not restricted to take on certain numerical
values: Often measurements (e.g., height, weight, age).
Continuous data are used to report a measurement of the
individual that can take on any value within an acceptable
range.
07/09/2021 20
Nominal Scale
Other Examples
Sex Social status
Marital status Days of the week (months)
Geographic location Seasons
Ethnic group Types of restaurants
Brand choice Religion
Job type : executive, technical, clerical
Coded as “0”
07/09/2021 Coded as “1” 22
Ordinal Scale
Level of measurement which classifies data into categories that can be
ranked. Differences between the ranks do not exist.
07/09/2021 23
Ordinal Scales
07/09/2021 24
Interval Scales
• Level of measurement which classifies data that can be ranked
and differences are meaningful. However, there is no meaningful
zero, so ratios are meaningless.
• All arithmetic operations except division are applicable.
• Relational operations are also possible.
Examples:
IQ
Temperature in oF.
07/09/2021 25
Interval Scale
Numerically equal distances on the scale represent equal values in
the characteristic being measured. An interval scale contains all the
information of an ordinal scale, but it also allows you to compare the
differences between objects.
assumes that the measurements are made in equal units.
i.e. gaps between whole numbers on the scale are equal.
e.g. Fahrenheit and Celsius temperature scales
an interval scale does not have a true zero.
e.g. A temperature of "zero" does not mean that there
is no temperature...it is just an arbitrary zero point.
permissible statistics: count/frequencies, mode, median,
mean,
07/09/2021
standard deviation 26
Ratio Scales
07/09/2021 27
Primary Scales of Measurement
Nominal Numbers
assigned to 4 81 9
runners
Gender Height
Grade(A, B, C, D and F ) Weight
Rating scale(poor, good, excelent) Time
Eye colour Age
Political affilation IQ
Temprature
Religious affilation
Salary
Ranking of tennis players
Majour field
Nationality
07/09/2021 30
University of Gondar
College of medicine and health science
Department of Epidemiology and Biostatistics
Wullo S. (MPH)
Reading assignment
Methods of data collection
• Having collected and edited the data, the next important step is to
organize it.
• The process of arranging data in to classes or categories according to
similarities is called classification
• Classification is a preliminary and it prepares the ground for proper
presentation of data.
• The presentation of data is broadly classified in to the following two
categories:
• Tabular presentation
• Diagrammatic and Graphic presentation.
7/9/21 Wullo S. 34
Tabular presentation of data
• Frequency distribution: is the organization of raw
data in table form using classes and frequencies.
• Frequency: is the number of values in a specific class
of the distribution
• Raw data: recorded information in its original
collected form, whether it be counts or
measurements, is referred to as raw data.
7/9/21 Wullo S. 35
Frequency distribution (F.D.)…
37
Categorical F.D…
38
Ungrouped FD for Discrete Variables
39
Discrete/Ungrouped FD…
No.of 2 3 4 5 6 7 8 Total
children
No. of family 5 7 8 4 1 2 3 30
(f)
40
Continuous/grouped F.D
41
Continuous/grouped F.D…
Example: Consider the following FD on wages of 100
workers in a factory.
Wage (CI) 40-44 45-49 50- 55-59 60- 65- 70-74 75-79
54 64 69
Freq. 6 9 15 17 20 13 12 8
CB’s 39.5- 44.5- 49.5- 54.5- 59.5- 64.5- 69.5- 74.5-
44.5 49.5 54.5 59.5 64.5 69.5 74.5 79.5
42
Continuous/grouped F.D…
43
Continuous/grouped F.D…
44
Continuous/grouped F.D…
The lowest and height values that can be included in a class such
The lower class limit of the first class should be the smallest
Add the size of a class on the lower class limit to obtain the
45
Cont…
To find the upper limit of the first class, subtract U from
the class width to this upper limit to find the rest of the
upper limits or
=LCL+ (W-1)
Continuous/grouped F.D…
4. Determine the Class boundaries
– Let U =LCL of a class – UCL of preceding class. Add half of
this difference (U/2) to all upper class limits to get the upper
class boundaries (UCBs), and subtract (U/2) from all lower
class limits to get the lower class boundaries (LCBs).
– UCBi = UCLi +U/2
– LCBi = LCLi – U/2
44 50 79 63 66 54 56 70 56 63
60 87 60 70 59 60 62 88 71 53
56 65 74 80 51 83 69 77 69 50
58 42 43 85 43 75 55 60 58 49
72 67 55 77 48 45 61 47 44 61
Solution:
Step 1: Find the highest and the lowest value H=88, L=42
Sturges formula;
(rounding up)
….Cont
Step 5: Select the starting observation as lowest class limit (this is
usually the lowest observation). Add the width to that observation
to get the lower limit of the next class. Keep adding until there are
7 classes.
42, 49, 56, 63, 70, 77, 84 are the lower class limits.
Step 6: Find the upper class limit; e.g. the first upper class=42-
U=49-1=48
48, 55, 62, 69, 76, 83, 90 are the upper class limits.
So combining step 5 and step 6, one can construct the following
classes.
So combining step 5 and step 6, one can construct the following classes.
Class limits
42-48
49-55
56-62
63-69
70-76
77-83
84-90
Step 7: Find the class boundaries by subtracting 0.5 from each lower class limit and
adding 0.5 to the UCL as shown.
and
LCBi LCLi U 2 UCBi UCLi U 2
Example: For class 1 = 42-0.5=41.5 and
UCB1 48 0.5 48.5
Then continue adding W on both boundaries to obtain the rest boundaries. By
doing so one can obtain the following classes.
Class boundary
41.5 – 48.5
48.5 – 55.5
55.5 – 62.5
62.5 – 69.5
69.5 – 76.5
76.5 – 83.5
83.5 – 90.5
Step 8: Tally the data.
Step 9: Write the numeric values for the tallies in the
frequency column.
Step 10: Find cumulative frequency.
Step 11: Find relative frequency and /or relative
cumulative frequency.
The complete frequency distribution follows
Total 50 1
Continuous/grouped F.D…
57, 53, 65, 55, 50, 45, 64, 52, 16, 46,
42, 63, 33, 64, 53, 25, 54, 35, 48, 55,
70, 47, 39, 58, 52, 36, 65, 75, 26, 20,
55, 60, 83, 61, 45, 63, 49, 42, 35, 18,
51, 45, 42, 65, 39, 59, 45, 41, 30, 40.
56
Continuous/grouped F.D…
Solution:
i. Using the Struges’ rule, the number of classes is:
k= 1+ 3.322 log 50 =6.64 ≈ 7.
ii. Range = highest value – lowest value
= 83 –16= 67.
Range 67
w 9.57 10
iii) Class width k 7
iv) Since the smallest value is 16, the LCL1 is 16 and the
UCL1 is 25; and the frequency distribution would look like:
57
Continuous/grouped F.D…
Here is the FD:
Ages Freq.
16-25 4
26-35 5
36-45 12
46-55 14
56-65 12
66-75 2
76-85 1
Total 50
58
Continuous/grouped F.D…
Example: The class marks and class boundaries of the
above Example are:
CL Freq. CM CB
16-25 4 20.5 15.5-25.5
26-35 5 30.5 25.5-35.5
36-45 12 40.5 35.5-45.5
46-55 14 50.5 45.5-55.5
56-65 12 60.5 55.5-65.5
66-75 2 70.5 65.5-75.5
76-85 1 80.5 75.5-85.5
Total 50
59
Continuous/grouped F.D…
Cumulative frequency distributions
Tells us how often the values fall below or above
that class. There are two types of CFD:
60
Continuous/grouped F.D…
Example: For the data in the above Example, both
cumulative frequency distributions are given below:
61
Diagrammatic and Graphical Methods of Data
Presentation
A F.D can be presented graphically or diagrammatically.
Advantages
• To understand the information easily.
• To make the data attractive.
• To make comparisons of items easy.
• To draw attention of the observer.
The purpose of graphs and diagrams is not to provide exact and
detailed information, but simple comparisons. Any further
information shall rather be obtained from the original data.
62
2.2.2 Diagrammatic Presentation of Data
300
600
100
400
100
• The box shows the distance between the first and the third
quartiles,
Min Q1 Q2 Q3 Max
35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
Illustration of Box-plot using the age of 15 patients
81
A box-plot indicating the distribution of blood lead level of
individuals by sex
82
Histogram
44 50 79 63 66 54 56 70 56 63
60 87 60 70 59 60 62 88 71 53
56 65 74 80 51 83 69 77 69 50
58 42 43 85 43 75 55 60 58 49
72 67 55 77 48 45 61 47 44 61
Example*
10
8
6
4
2
0
41.5 – 48.5 – 55.5 – 62.5 – 69.5 – 76.5 – 41.5 –
48.5 55.5 62.5 69.5 76.5 83.5 48.5
Blood Glucose Level
FrequencyPolygon
Line graph of class marks against class frequencies.
To draw a frequency polygon we connect the midpoints of
class boundaries of the histogram by a straight line.
Frequncy (N umber of
14
12
10
Patients)
8
6
4
2
0
38 45 52 59 66 73 80 87 94
Class Marks (Blood Glucose Level)
Ogive (cumulative frequency polygon)
• A graph showing the cumulative frequency (less than or more than
type) plotted against upper or lower class boundaries respectively.
• That is class boundaries are plotted along the horizontal axis and
the corresponding cumulative frequencies are plotted along the
vertical axis.
• The points are joined by a free hand curve.
• Example: Draw an ogive curve(less than type) for the above data.
(Example *)
Ogive Graph (Cumulative Less Than Type)
60
50
40
30
20
10
0
41.5 48.5 55.5 62.5 69.5 76.5 83.5 90.5
University of Gondar
College of Medicine and Health Sciences
Institute of Public Health
Department of Epidemiology and Biostatistics
University of
Gondar, Ethiopia
Leaning outcomes
• After completing this chapter a student will able to;
• List and calculate measures of central tendency
• List and calculate measures of dispersion
• List and calculate measures of shape
07/09/2021 91
Average should posses the following properties:
07/09/2021 94
Example
• Consider
the data on birth weight of 10 new born
children in kilo gram at university of Gondar
hospital:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88,
2.43.
Then the average birth weight can be computed as:
07/09/2021 95
Arithmetic mean cont…
When the data are arranged or given in the
form of frequency distribution i.e. there are k
variety values such that a value Xi has a
frequency f i ( i=1,2,---,k) ,then the Arithmetic
mean will be given as
Example
Solution:
130
𝑓 𝑖 × 𝑥𝑖
18× 2+19× 1+ 20× 4+ …+29 ×12
´𝑥 =∑ =
𝑖=1 ∑ 𝑓𝑖 2+1+4+ …+12
3180
¿ =24.46 ≈ 25
130
07/09/2021 97
Exercise
• Consider the following frequency distribution
table
Data value 10 20 30 40 50 60 70 80 90 100 110
Frequency 3 5 6 8 10 12 15 10 10 12 5
07/09/2021 98
Mean for Grouped Data?
07/09/2021 99
Mean for Grouped Data
Example: calculate the mean for the grouped
distribution table given below:
Class Frequency
6-10 35
11-15 23
16-20 15
21-25 12
26-30 9
31-35 6
07/09/2021 100
Example cot…
• Solution
Class Class mid (Xm Frequency fi × Xm
6-10 8 35 280
11-15 13 23 299
16-20 18 15 270
21-25 23 12 276
26-30 28 9 252
31-35 33 6 198
Total 100 1,575
• Therefor
07/09/2021 101
Properties of the arithmetic mean
– The mean can be used as a summary measure for both discrete
and continuous data, in general however, it is not appropriate for
either nominal or ordinal data.
– For a given set of data there is one and only one arithmetic mean.
07/09/2021 103
Median
• An alternative measure of central location, perhaps
second in popularity to the arithmetic mean.
• Suppose there are n observations in a sample.
• If these observations are ordered from smallest to
largest, then the median is defined as follows:
• The median, is a value such that at least half of the
observations are less than or equal to median and
at least half of the observations are greater than or
equal to median .
• The median is the midpoint of the data array.
07/09/2021 104
Median
Ungrouped data
• If the number of observations is odd, the median is defined
as the [(n+1)/2]th observation.
• If the number of observations is even the median is the
average of the two middle (n/2)th and [(n/2)+1]th values i.e
• To find the median of a data set:
– Arrange the data in ascending order.
– Find the middle observation of this ordered data.
Example1: where n is even: 19, 20, 20, 21, 22, 24, 27, 27,
27, 34
• Then, the median = (22 + 24)/2 = 23
105
Example 2
The number of children with asthma during a specific year in
seven local districts clinic is shown. Find the median for this
data set.
253, 125, 328, 417, 201, 70, 90
Solution:
First we must arrange the data in ascending order
70, 90, 125, 201, 253, 328, 417
Therefore, the fourth observation is the median of the data, i.e.
the value 201 is the median value
07/09/2021 106
Exercise
• The actual waiting time for the first job on the
selected sample of nine people having
different field of specialization was given
below.
07/09/2021 107
Median cont…
Median for grouped data.
-If data are given in the shape of continuous frequency
distribution, the median is defined as:
¿
07/09/2021 110
Median in grouped data …
•
¿
07/09/2021 111
Merits and Demerits of Median
Merits:
• Median is a positional average and hence not influenced by extreme observations.
07/09/2021 114
Mode for Grouped data
In grouped data, we usually refer to the modal class, class
with highest frequency. If a single value for the mode of
grouped data must be specified, it is taken as:
1
Mode L w
1 2
15-19 6
20-24 19
25-29 50
30-34 57
35-39 48
40-44 27
45-49 21
Total 228
116
The Mode…
Solution: By inspection (simply looking at the
frequencies), the mode lies in the fourth class, where L
=29.5, fmod = 57, f1=50, f2=48, w = 5, and
1 57 50 7, 2 57 48 9
118
Merits and Demerits of Mode
Merits:
It is not affected by extreme observations.
Easy to calculate and simple to understand.
It can be calculated for distribution with open end class.
Demerits:
It is not rigidly defined.
It is not based on all observations.
It is not suitable for further mathematical treatment.
It is not stable average, i.e. it is affected by fluctuations of sampling
to some extent.
Often its value is not unique.
119
Quartiles
B. Even:
07/09/2021 121
Quartiles
•
W iN
Qi LQi ( C ), i 1,2,3
f Qi 4
Measure of variation/dispersion
Definition:
07/09/2021 123
Measure of variation cont…
A good measure of variation posses:
• It should be easy to compute and understand.
• It should be based on all observations.
• It should be Uniquely defined
• It should be capable of further algebraic treatment.
• It should be as little as affected by extreme values
07/09/2021 124
Measure of variation Cont…
Absolute and relative measures
Measures of dispersion may be either absolute or relative
1. Absolute measures of dispersion (AMD): Absolute
measure is expressed in the SI unit in which the original data
are given such as kilograms, tones etc.
• These measures are suitable for comparing the variability in two
distributions having variables expressed in the same units and of the
same averaging size.
• These measures are not suitable for comparing the variability in two
distributions having variables expressed in different units.
Measure of variation cont…
Various measures of dispersions are in use. The most commonly used measures of dispersions are;
07/09/2021 128
Range cot…
• It is based upon two extreme cases in the entire distribution,
the range may be considerably changed if either of the
extreme cases happens to drop out, while the removal of
any other case would not affect it at all.
• It wastes information , it takes no account of the entire data.
07/09/2021 129
Quartiles and Inter-quartile Range, Percentiles
Q1 Q2 Q3
35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
• IQR = 48 – 37 = 11
130
Quartile deviation (QD)
The range expresses the extreme variability of
observations of a variable.
is half of the inter quartile range.
131
Coefficient of quartile deviation (CQD)
Q3 Q1
CQD
Q3 Q1
132
Variance and Standard Deviation
• Variance measure how far on average scores
deviate or differ from the mean.
07/09/2021 133
07/09/2021 134
Variance
I.e. The sample variance, denoted by s2 , of a set of n
135
Standard deviation
• There
a problem in a variance because the
deviations are squared and its units also square,
in order to get the original unit of measurement
we insert in to square root.
07/09/2021 136
Standard cont…
• Consider the following three datasets
Next subtract the mean from each value and square it:
X X-
07/09/2021 138
Cont…
•Sum
up all the squared values
07/09/2021 139
Exercise
• The Areas of spray able surfaces with DDT from a sample of 15
houses are measured as follows (in m2) :
101,105,110,114,115,124,125,125,130,133,135,136,13 7,140,145
07/09/2021 140
Example 2
• Find the variance and the standard deviation for
the frequency distribution of the given data set
Class
below. Frequency Midpoint
5.5 – 10.5 1 8
10.5 – 15.5 2 13
15.5 – 20.5 3 18
20.5 – 25.5 5 23
25.5 – 30.5 4 28
30.5 – 35.5 3 33
35.5 – 40.5 2 38
07/09/2021 141
Cot…
• Solution
07/09/2021 143
Special properties of standard deviation /variance
144
Special properties of standard deviation
S p 1 1 2 2
n n 2
1 2
145
Coefficient of variation
• The standard deviation is an absolute measure of deviation of
observations around their mean and is expressed with the same
unit of the data.
• Due to this nature of the standard deviation it is not directly used
for comparison purposes with respect to variability.
• Coefficient of variation, is often used for this purpose
• The coefficient of variation (CV) is defined by:
CV =
07/09/2021 146
Examples:
1. An analysis of the monthly wages paid (in Birr) to
workers in two firms A and B belonging to the same
pharmaceutical industry gives the following results
07/09/2021 147
07/09/2021 148
Coefficient of variation cont…
Exercise
2. A meteorologist interested in the consistency of
temperatures in three cities during a given week collected
the following data. The temperatures for the five days of
the week in the three cities were
City 1 : 25 24 23 26 17
City2 : 22 21 24 22 20
City3 : 32 27 35 24 28
Which city have the most consistent temperature, based
on these data?
07/09/2021 149
When to use coefficient of variance
• When comparison groups have very different means (CV
is suitable as it expresses the standard deviation relative to
its corresponding mean)
• When different units of measurements are involved,
e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)
• In such cases, standard deviation should not be used for
comparison
07/09/2021 150
Exercise
1. Based on the given data set given below
a. Calculate mean, median and mode
b. Calculate variance, standard deviation and coefficient of
variation
15, 7, 13, 9, 10, 11
2. Calculate variance and standard deviation for the following data set
geven below;
5, 17, 12, 10, 8
07/09/2021 151
Standard Score
If X is a measurement from a distribution
with mean and standard
deviation S, then its value in standard units is
r 0,1,2,
n
• for continuous grouped data it is given by:
mr
fi ( X i X )r
n
156
Example:
(2 4) (3 4) (7 4)
0
n 3
m2
i
( X X ) 2
(2 4) 2 (3 4) 2 (7 4) 2
4.67
n 3
m3
i
( X X ) 3
(2 4) 3 (3 4) 3 (7 4) 3
6
n 3
157
Measures of shape
a. Skewness
• Skewness is the degree of asymmetry or
departure from symmetry of a distribution.
• A skewed frequency distribution is one that is not
symmetrical.
• Skewness is concerned with the shape of the curve not size.
• If the frequency curve (smoothed frequency polygon) of a
distribution has a longer tail to the right of the central
maximum than to the left, the distribution is said to be
skewed to the right or said to have positive Skewness.
07/09/2021 158
Concept of skewness
• The skewness of a distribution is defined as
the lack of symmetry.
• In a symmetrical distribution, mean, median,
and mode are equal to each other.
07/09/2021 159
Skewness
• If it has a longer tail to the left of the central
maximum than to the right, it is said to be
skewed to the left or said to have negative
Skewness.
• For moderately skewed distribution, the
following relation holds among the three
commonly used measures of central
tendency.
Mean-Mode=3*(Mean-Median)
07/09/2021 160
Skewness
Measures of Skewness
The Karl Pearson’s Coefficient of Skewness (SK):
Mean Mode 3( Mean Median )
Sk Sk
S tan dard deviation S tan dard deviation
07/09/2021 162
Remarks Related with Skewness
In a negatively skewed distribution, smaller
observations are less frequent than larger
observations i.e. the majority of the observations
have a value above an average
07/09/2021 163
Kurtosis
• Kurtosis is the degree of peakdness of a distribution, usually taken
relative to a normal distribution.
The peakdness of a distribution be classified in to three:
• Leptokurtic: -
- A distribution having relatively high peak
- A large number of observations have same values
• Mesokurtic: -
- Normal peak
- The curve is properly peaked
• Platykurtic:
Flat toped
A large number of observations have low frequency are
spread in the middle interval.
07/09/2021 164
Kurtosis
07/09/2021 165
Measures of kurtosis
m4
2
m2 2
07/09/2021 166
You
a n k
Th
07/09/2021 167
University of Gondar
College of Medicine and Health Sciences
Institute of Public Health
Department of Epidemiology and Biostatistics
Wullo S. (MPH)
7/9/21
Learning outcomes
After studying this chapter, the student will be able to:
4.1 Define basic terms in probability
4.2 Describe set theory and probability
4.3 Identify types of probability
4.4 Identify types of random variable and probability distribution
4.5 List common probability distributions and their
properties
07/09/2021 169
PROBABILITY CONCEPTS
Introduction
Probability lays the foundation for statistical inference
This chapter provides a brief overview of the probability
concepts necessary for understanding topics covered in
the chapters that follow
It also provides a context for understanding the probability
distributions used in statistical inference
07/09/2021 170
Basic Terms of Probability
• Probability can be defined as the chance of an event
occurring.
• Probability experiment: is a process that leads to well-
defined results or is an action through which specific
results/outcomes (counts, measurements or responses)
are obtained. But that is the result cannot be predicted.
Example:
• Tossing a coin and observing the face showing up is a
probability experiment.
• Outcome: It is the result of a single trial in a probability
experiment. It is also called simple event.
Example: the outcome of the sex of a newborn from a
mother in delivery room is either Male or female
07/09/2021 171
Basic concepts con'td….
07/09/2021 173
Exercise
Find the sample space for the gender of the children
if a family has three children. Use B for boy and G
for girl
And also find:
a. The probability of obtaining at least two girls in a
family?
b. The probability of getting at most two boys in a family?
c. The probability of getting one boys and two girls in a
family?
07/09/2021 174
Types of probability
1. Classical (or theoretical) probability
It is used when each outcome in a sample space is
equally likely to occur.
That is if an experiment has n equally likely outcomes,
then each possible outcome must have probability of 1/n
to occur Or, equivalently the probability for event E is;
07/09/2021 175
Types of probability cont…
2. Empirical (or statistical) probability: is based on
observations obtained from experiments /a large
number of trials or from historical data.
Example:
• A medical doctor realized that out of 100,000 patients
visited the hospital, there are 50 cancer cases. What
is the probability that a patient to be examined will be
positive for cancer?
P(+ve for cancer) = 50/100,000 = 0.0005
07/09/2021 176
Example 2
In a sample of 50 people, 21 had type O blood, 22 had type A
blood, 5 had type B blood, and 2 had type AB blood. Set up
a frequency distribution and find the following probabilities
a. A person has type O blood
b. A person has type A or type B blood
c. A person has neither type A nor type O blood
d. A person does not have type AB blood
07/09/2021 177
Solution
Blood type Frequency
A 22
B 5
AB 2
O 21
Total 50
07/09/2021 178
• Union of events: The union of two events A and B, denoted
by (AUB) , consists of all outcomes that are in A or in B or
both A and B.
If A and B are two events, then
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
If A and B are mutually exclusive/independent, then
P(A ∪ B) = P(A) + P(B)
Example
In a hospital unit there are 8 nurses and 5 physicians; 7 nurses
and 3 physicians are females. If a staff person is selected,
find the probability that the subject is:
a. nurse or a male
b. physician or female
07/09/2021 179
•Solution
Staff Gender
Male Female Total
Physician 2 3 5
Nurse 1 7 8
Total 3 10 13
The probability is
P(N M) = P(N) + P(M) - P(MN)
= 8/13+3/13-1/13
= 0.615 + 0.23-0.077
= 0.768
07/09/2021 180
3. Subjective probability
• In this view, probability is treated as a quantifiable level of belief
ranging from 0 (complete disbelief) to 1 (complete belief)
07/09/2021 183
Conditional probability
P(A/B) =P(AB)/P(B)
Where: P(B)≠0
Special case: When events A and B are independent, then:
P(A|B) = P(A)
P(AB)=P(A)P(B)
07/09/2021 184
Example
• In a certain high school class, consisting of 60 girls and 40
boys, it is observed that 24 girls and 16 boys wear
eyeglasses. If a student is picked at random from this class,
the probability that the student wears eyeglasses, P(E), is
40/100, or 0.4.
a. What is the probability that a student picked at random wears
eyeglasses, given that the student is a boy?
solution
07/09/2021 188
Factorial
For any positive integer n, n factorial denoted as
n! is defined as:
n! = n×(n-1) ×(n-2) ×…. ×3x2x1
e.g. 3!=3x2x1=6
5!= 5x4x3x2x1=120
07/09/2021 189
Permutation rules
• Permutation: is the number of possible
permutations is the number of different orders in
which particular events occur. The number of
possible permutations are
p( n, r) =
example: 6p2=
07/09/2021 190
Combination
The number of ways r objects can be chosen a set of n
objects without considering the order of selection is
called the number of combination of n objects taking r
of them at a time, denoted by
C(8,6) =
C(8,0)=
07/09/2021 191
Probability Distribution
07/09/2021 192
• Discrete random variables: have a finite number of possible
values or an infinite number of values that can be counted
– The word counted means that they can be enumerated
using the numbers 1, 2, 3, etc
– Variables that can assume all values in the interval
between any two given values are called continuous
variables
– Continuous random variables can assume an infinite
number of values and can be decimal and fractional
values
07/09/2021 193
Examples of discrete random variable:
• Toss a coin “n” time and count the number of heads.
• number of car accidents per week.
• Number of defective items in a given company.
• Number of bacteria per two cubic centimeter of water
Examples of continuous random variable:
• Height of students at certain college.
• Mark of a student.
• Life time of certain disease .
• Length of time required to complete a given training
07/09/2021 194
The probability distribution of a discrete random variable is a
table, graph, formula, or other device used to specify all
possible values of a random variable along with their
respective probabilities
Example:
Consider the experiment of tossing a coin three times. Let X be
the number of heads. Construct the probability distribution
of X
X 0 1 2 3
P(x) 1/8 3/8 3/8 1/8
07/09/2021 195
Example 2:
Construct a probability distribution for rolling a single die.
Solution
Since the sample space is 1, 2, 3, 4, 5, 6 and each outcome
has a probability of , the distribution
X 1 2 3 4 5 6
p(x) 1/6 1/6 1/6 1/6 1/6 1/6
07/09/2021 196
Two requirements for probability distribution
•• The
sum of the probabilities of all events in the sample
space must be equal to 1; i.e.
07/09/2021 197
Properties of continuous probability distribution
1.
07/09/2021 198
Introduction to expectation
Definition: the expected value (also known as the
mean) of a random variable is a measure of the
center location for the random variable.
1. Discrete R.V
n
E(X) = X1P(X1) +X2P(X2) +…. +XnP(Xn) = X .P X i i
i 1
2. Continuous R.V
b
E X X . f ( x)d ( x)
a
07/09/2021 199
Variance Probability distribution
• The expected value of X is its mean
Mean of X= E(X)
• The variance of X is given by:
Variance of X=Var(x) = E X 2 ( E X ) 2
n
E ( X ) X i .P X i
2 2
if X is discrete
i 1
X 2 f x d ( x) if X is continuous
x
07/09/2021 200
Example
Let X be a continuous R.V with distribution
1
x 0 x2
f ( x) 2
0, otherwise
Then find
a) P (1<x<1.5)
b) E(x)
c) Var(x)
d) E (3x 2 2 x)
07/09/2021 201
o n
r ib uti
ist
t y D
i li
o b ab
te pr
re
D isc
07/09/2021 202
1. Binomial Distribution
07/09/2021 203
Binomial distribution Cont..
•
Definition: The outcomes of the binomial experiment and the
corresponding probabilities of these outcomes are called
BinomialDistribution.
If the probability of success on an individual trial is P,
then the binomial probability is defined by:
Where:
– x=the number of success
– P=probability of success
– n=the number of experiments
– 1-p=probability of failure
07/09/2021 204
•When
using the binomial formula to solve problems, we have to
identify three things:
The number of trials (n)
The probability of a success on any one trial (P) and
07/09/2021 205
Example: Suppose that an examination consists of six true and
false questions, and assume that a student has no
knowledge of the subject matter. The probability that the
student will guess the correct answer to the first question is
30%. Likewise, the probability of guessing each of the
remaining questions correctly is also 30%.
a) What is the probability of getting exactly three correct
answers?
b) What is the probability of getting exactly two correct
answers?
c) What is the probability of getting at most two correct
answers?
d) What is the probability of getting less than five correct
answers?
e. Find expected value and standard deviation?
07/09/2021 206
•Solution:
a.
b. 07/09/2021 207
•c.
d.
07/09/2021 208
Exercise
1. Suppose 14 percent of mothers admitted to smoking one or
more cigarettes per day during pregnancy. If a random
sample of size 10 is selected from this population, what is
the probability that it will contain exactly four mothers who
admitted to smoking during pregnancy?
2. Suppose that 80% of adults with allergies report
symptomatic relief with a specific medication. If the
medication is given to 10 new patients with allergies, what is
the probability that it is effective in exactly seven? assume
that the replications are independent.
07/09/2021 209
2.Poisson distribution
The probability distribution of a Poisson random variable X
representing the number of successes occurring in a given time
interval or a specified region of space is given by the formula:
Where
• k=Number of successes per unit time
07/09/2021 211
Example:
In a study of drug-induced anaphylaxis among patients taking
rocuronium bromide as part of their anesthesia, the
occurrence of anaphylaxis followed a Poisson distribution
with λ =12 incidents per year in Norway. Find the probability
that in the next year, among patients receiving rocuronium,
a. exactly three will experience anaphylaxis.
b. At least two will experience anaphylaxis
c. At most two experience anaphylaxis
07/09/2021 212
•
Solution:
a.
b.
07/09/2021 213
Exercise
In a certain population an average of 13 new cases of
esophageal cancer are diagnosed each year. If the annual
incidence of esophageal cancer follows a Poisson
distribution, find the probability that in a given year the
number of newly diagnosed cases of esophageal cancer will
be:
A. Exactly 10 cases
B. At least three cases
C. No more than 3
D. Between nine and 12, inclusive
E. Fewer than two
07/09/2021 214
CONTINUOUS PROBABILITY DISTRIBUTIONS
07/09/2021 215
Normal distribution
where
• X is a normal random variable,
• μ is the mean
• σ is the standard deviation
• pi is approximately 3.14159, and e is approximately 2.71828.
• The random variable X in the normal equation is called the
normal random variable.
07/09/2021 216
Characteristics of Normal Distribution
• It links frequency distribution to probability distribution
• Has a Bell Shape Curve and is Symmetric
• It is Symmetric around the mean: Two halves of the
curve are the same (mirror images)
• Hence Mean = Median=mode
• The total area under the curve is 1 (or 100%)
• Normal Distribution has the same shape as Standard
Normal Distribution.
07/09/2021 217
Normal Curve
• The graph of the normal distribution depends on two factors:
the mean and the standard deviation.
• The mean of the distribution determines the location of the center of the
graph, and the standard deviation determines the height and width of the
graph.
• When the standard deviation is large, the curve is short and wide; when
the standard deviation is small, the curve is tall and narrow.
• All normal distributions look like a symmetric, bell-shaped curve.
07/09/2021 218
Standard Normal Distribution
• It makes life a lot easier for us if we standardize our normal
curve, with a mean of zero and a standard deviation of 1
unit.
• We can transform all the observations of any normal random
variable X with mean μ and variance σ to a new set of
observations of another normal random variable Z with mean
0 and variance 1 using the following transformation:
07/09/2021 219
• About 95% of the area under the curve falls within 2
standard deviations of the mean
• About 99.7% of the area under the curve falls within 3
standard deviations of the mean
• A graph of this standardized (mean 0 and variance 1) normal
curve is given in Graph:
07/09/2021 220
Probability and Normal Distributions
07/09/2021 222
Table of normal distribution
• Example 1: Suppose we want to compute the area
under the normal curve to the left of 1.45
• This area can be computed by finding the probability under
the normal curve.
• The probability can be read at the normal curve by combining
the value of 1.4 under the first column and 0.05 under the
first row.
• The left side of the area in the diagram represents the area
that is within 1.45 standard deviations from the mean.
• The area of this shaded portion is 0.9265(or 92.65% of the
total area under the curve).
07/09/2021 223
07/09/2021 224
Example:
Find the area to the left of z = 2.06
Solution
Step 1: Draw the figure
07/09/2021 225
Step2: We are looking for the area under the standard normal
distribution to the left of z = 2.06, It is 0.9803. Hence, 98.03%
of the area is less than z = 2.06.
07/09/2021 226
Find the area between z = 1.68 and z =-1.37.
Solution
Step 1: Draw the figure as shown.
Step 2 Since the area desired is between two given z values, look up
the areas
corresponding to the two z values and subtract the smaller area from the
larger area. (Do not subtract the z values.) The area for z=1.68 is 0.9535,
and the area for z= -1.37 is 0.0853. The area between the two z values is
0.9535 - 0.0853 = 0.8682 or 86.82%
07/09/2021 227
Example:
For subject A, a 27-year-old female, the ammonia concentration
in parts per billion (ppb) followed a normal distribution over 30
days with mean 491 and standard deviation 119.What is the
probability that on a random day, the subject’s ammonia
concentration is between 292 and 649 ppb?
Solution:
We find the z value corresponding to an x of 292 by
07/09/2021 228
The area desired is the difference between these, 0.9082
-0.0475 = 0. 8607.
Exercise:
1. For another subject (a 29-year-old male), the acetone levels
were normally distributed with a mean of 870 and a standard
deviation of 211 ppb. Find the probability that on a given day
the subject’s acetone level is:
a. Between 600 and 1000 ppb
b. Over 900 ppb
c. Under 500 ppb
d. Between 900 and 1100 ppb
07/09/2021 229
2. If the total cholesterol values for a certain population are
approximately normally distributed with a mean of 200
mg\100 ml and a standard deviation of 20 mg\100 ml, find the
probability that an individual picked at random from this
population will have a cholesterol value:
a. Between 180 and 200 mg\100 ml
b. Greater than 225 mg\100 ml
c. Less than 150 mg\100 ml
d. Between 190 and 210 mg\100 ml
07/09/2021 230
Student t-distribution
• It is often the case that one wants to calculate the size
of sample needed to obtain a certain level of confidence
in survey results.
• Unfortunately, this calculation requires prior knowledge
of the population standard deviation σ.
• Realistically, σ is unknown
• Often a preliminary sample will be conducted so that a
reasonable estimate of this critical population parameter
can be made.
• If such a preliminary sample is not made, but
confidence intervals for the population mean are to be
constructing using an unknown σ, then the distribution
known
07/09/2021 as the Student t distribution can be used. 231
Student’s t-distribution cont…
•• Suppose
we have a simple random sample of size n
drawn from a Normal population with mean μ and
standard deviation σ. Let us denote the sample mean
by and sample standard deviation by s, then the
quantity:
07/09/2021 232
Some properties of t-distribution are;
The t distribution shares some characteristics of the normal
distribution and differs from it in others. The t distribution is
similar to the standard normal distribution in these ways:
1. It is bell-shaped.
2. It is symmetric about the mean.
3. The mean, median, and mode are equal to 0 and are located
at the center of the distribution.
.Converges to the normal distribution as the sample size gets
large
5. The curve never touches the x axis.
07/09/2021 233
The t distribution differs from the standard normal distribution in
the following ways:
The variance is greater than 1.
The t distribution is actually a family of curves based on
the concept of degrees of freedom, which is related to
sample size.
As the sample size increases, the t distribution
approaches the standard normal distribution
07/09/2021 234
o u v e ry
a nk y
Th ! !!
M uc h
07/09/2021 235
University of Gondar
College of medicine and health science
Department of Epidemiology and Biostatistics
University
of Gondar, Ethiopia
236
Objectives
• After complete this session you will be able to
do
– Parameter estimations
– Point estimate
– Confidence interval
– Hypothesis testing
– Z-test
– T-test
– Analysis of variance
237
Introduction # 1
• Inferential is the process of generalizing or
drawing conclusions about the target
population on the basis of results obtained
from a sample.
238
Introduction #2
• Before beginning statistical analyses
– it is essential to examine the distribution of the variable for
skewness (tails),
– kurtosis (peaked or flat distribution), spread (range of the
values) and
– outliers (data values separated from the rest of the data).
• Information about each of these characteristics
determines to choose the statistical analyses and can
be accurately explained and interpreted.
239
Sampling Distribution
240
Sampling distribution .......
•
243
Sampling Distribution ..........
•
244
Standard deviation and Standard error
249
Point Estimation
•
x
p=
n
250
Example
•
Some BLUE estimators
252
Interval Estimation
• However the value of the sample statistic will vary from
sample to sample therefore, to simply obtain an
estimate of the single value of the parameter is not
generally acceptable.
– We need also a measure of how precise our estimate is likely
to be.
– We need to take into account the sample to sample variation
of the statistic.
• A confidence interval defines an interval within which
the true population parameter is like to fall (interval
estimate).
253
Confidence Intervals…
255
Confidence interval ……
[ x z . , x z . ]
2 n 2 n
[ p z . p(1 p) / n , p z . p(1 p) / n ]
2 2
256
Interval estimation
257
258
259
260
Confidence intervals…
• The 95% confidence interval is calculated in such a way that, under the
conditions assumed for underlying distribution, the interval will contain true
population parameter 95% of the time.
• Loosely speaking, you might interpret a 95% confidence interval as one which
you are 95% confident contains the true parameter.
• 90% CI is narrower than 95% CI since we are only 90% certain that the interval
includes the population parameter.
• On the other hand 99% CI will be wider than 95% CI; the extra width meaning
that we can be more certain that the interval will contain the population
parameter. But to obtain a higher confidence from the same sample, we must
be willing to accept a larger margin of error (a wider interval). 261
Confidence intervals…
•
263
Confidence interval for a single mean
CI =
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 265
Confidence interval ……
}
f(t)
0 .2
9 1.383 1.833 2.262 2.821 3.250
10 1.372 1.812 2.228 2.764 3.169
11 1.363 1.796 2.201 2.718 3.106 0 .1
12 1.356 1.782 2.179 2.681 3.055
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977 0 .0
15 1.341 1.753 2.131 2.602 2.947 -1.372 0 1.372
-2.228 2.228
16 1.337 1.746 2.120 2.583 2.921
}
17 1.333 1.740 2.110 2.567 2.898 t
18 1.330 1.734 2.101 2.552 2.878
19 1.328 1.729 2.093 2.539 2.861 Area = 0.025 Area = 0.025
20 1.325 1.725 2.086 2.528 2.845
21 1.323 1.721 2.080 2.518 2.831
22
23
1.321
1.319
1.717
1.714
2.074
2.069
2.508
2.500
2.819
2.807 Wheneverisisnot
Whenever notknown
known(and
(andthe
thepopulation
populationisis
24
25
1.318
1.316
1.711
1.708
2.064
2.060
2.492
2.485
2.797
2.787 assumednormal),
assumed normal),thethecorrect
correctdistribution
distributiontotouse
useisis
26
27
1.315
1.314
1.706
1.703
2.056
2.052
2.479
2.473
2.779
2.771 thet tdistribution
the distributionwith
withn-1
n-1degrees
degreesofoffreedom.
freedom.
28
29
1.313
1.311
1.701
1.699
2.048
2.045
2.467
2.462
2.763
2.756 Note,however,
Note, however,that
thatfor
forlarge
largedegrees
degreesof offreedom,
freedom,
30
40
1.310
1.303
1.697
1.684
2.042
2.021
2.457
2.423
2.750
2.704 thet tdistribution
the distributionisisapproximated
approximatedwellwellbybythe
theZZ
60
120
1.296
1.289
1.671
1.658
2.000
1.980
2.390
2.358
2.660
2.617 distribution.
distribution.
1.282 1.645 1.960 2.326 2.576
267
Point and Interval Estimation of the Population Proportion (p)
xi
0.295
x= i =1
0.01844
n 16
df
---
t0.100
-----
t0.050
-----
t0.025
------
t0.010
------
t0.005
------ The critical value of t for df = (n -1) = (15 -1)
1
.
3.078
.
6.314
.
12.706
.
31.821
.
63.657
. =14 and a right-tail area of 0.025 is:
t 0.025 2.145
. . . . . .
. . . . . .
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977 The corresponding confidence interval or
15 1.341 1.753 2.131 2.602 2.947 s
. . . . . .
interval estimate is: x t 0. 025
.
.
.
.
.
.
.
.
.
.
.
. n
35
.
10.37 2.145
15
10.37 1.94
8.43,12.31
272
Example 3:
273
Example 4:
In a sample of 400 people who were questioned regarding their participation in sports,
160 said that they did participate. Construct a 98 % confidence interval for P, the
proportion of P in the population who participate in sports.
Solution:
Let X= be the number of people who are interested to participate in sports.
X=160, n=400, =0.02, Hence
Z 2 Z 0.01 2.33
Introduction
– Researchers are interested in answering many types of
questions. For example, A physician might want to
know whether a new medication will lower a person’s
blood pressure.
276
Idea of hypothesis testing
277
type of Hypotheses
• Null hypothesis (represented by HO) is the statement about the value of the
population parameter. That is the null hypothesis postulates that ‘there is no
difference between factor and outcome’ or ‘there is no an intervention effect’.
• Alternative hypothesis (represented by HA) states the ‘opposing’ view that
‘there is a difference between factor and outcome’ or ‘there is an intervention
effect’.
278
Methods of hypothesis testing
Identify the null hypothesis H0 and Choose a. The value should be small, usually less
than 10%. It is important to consider the
the alternate hypothesis HA. consequences of both types of errors.
3 4
Select the test statistic and
determine its value from the sample Compare the observed value of the statistic to
data. This value is called the the critical value obtained for the chosen a.
observed value of the test statistic.
Remember that t statistic is usually
appropriate for a small number of
samples; for larger number of 5
samples, a z statistic can work well if Make a decision.
data are normally distributed. 6
Conclusion
280
Test Statistics
Observed _ Hypothesized
Test statistics = value value .
Standard error
• The known distributions are Normal distribution, student’s distribution , Chi-
square distribution ….
281
Critical value
• The critical value separates the critical region from the noncritical region
for a given level of significance
282
Decision making
• Accept or Reject the null hypothesis
• There are 2 types of errors
H0: m = m0
One tailed test a Critical
Value(s)
H1: m < m0
0
Rejection Regions
a
H0: m = m0
H1: m > m0 0
a/2
H0: m = m0
H1: m ¹
0
m0 Two tailed test
Hypothesis testing about a Population mean
(μ)
1 H 0 : 0 ( 0 )
H A : 1 0 ( 0 )
x 0
zcal
n
ztabulated z for two tailed test
2
289
One tailed tests
2 H 0 : 0 ( 0 )
H A : 1 0 ( 0 )
x 0
z cal , ztabulated z for one tailed test
n
if z cal ztab reject H o
Decision :
if z cal ztab do not reject H o
3 H 0 : 0 ( 0 )
H A : 1 0 ( 0 )
if z cal ztab reject H o
Decision :
if z cal ztab do not reject H o
290
The P- Value
291
P-value……
• A p-value is the probability of getting the observed
difference, or one more extreme, in the sample purely
by chance from a population where the true
difference is zero.
• When the p-value is less than to 0.05, we often say that the
result is statistically significant.
295
Hypothesis testing for single population mean
298
Hypothesis testing for single proportions
• If the sample size is small (if np<5 and n(1-p)<5) then use student’s
t- statistic for the tabulated value of the test statistic.
301
Statistical Inference Based on
Two Samples
302
Statistical Inferences Based on Two Samples
•
Comparing Two Population Means;
Independent Samples, Vars Known cont’d…
•
12 22
x x
1 2
n1 n2
Comparing Two Population Means;
Ind’t Samples, Vars Known cont’d…
z
x1 x2 D0
12 2
2
; where Do = (µ1 – µ2)o
n1 n2
Hypothesis testing for two sample means
( x y ) ( 1 2 )
z cal
12 22
n1 n2 307
Hypothesis …
308
Example 8:
H O : 1 2 0
H A : 1 2 0
309
SOLUTION
( x y ) ( 1 2 ) ( 4.3 3.4) 0
z cal
2
2
2.9 2 3.5 2
1
2
n1 n2 12 15
1.6 1.6
5.33
1.5178 1.23
z z 0.025 1.96
2
310
Comparing Two Population Means;
case-2: Independent Samples, Variances Unknown
•
Comparing Two Population Means;
Ind’t Samples, Vars Unknown cont’d…
s 2p
n1 1 s12 n2 1 s 22
n1 n 2 2
•
1
2 1
x1 x2 t 2 ( n1 n 2 2) s p
n n
1 2
t
x1 x2 D0
1 1
s
n n
2
p
1 2
Comparing Two Population Means;
Ind’t Samples, Vars Unknown cont’d…
•
z
x1 x2 D0
s12 s 22
n1 n2
x1 x2 D0 df
s 2
1 /n1 s /n 2
2
2
2
s /n1 s /n 2
t 2 2
s2
s 2 2 2
1 2
1 2
n1 n2
n1 1 n2 1
Comparing Two Population Means;
Case-3: Paired/matched/repeated sampling
•
Paired sampling cont’d…
•
Paired sampling cont’d…
•
Hypothesis testing for two proportions
p̂1 1 p̂1 p̂ 2 1 p̂ 2
p̂1 p̂ 2 z 2
n1 n2
z=
p̂1 p̂ 2 D0
p̂1 p̂2
where Do = (P1-P2)0
Hypothesis testing for two proportions
( p1 p2 ) ( 1 2 )
zcal
p1 (1 p1 ) p2 (1 p2 )
n1 n2
323
Small sample size
•
324
Comparing Two Population Proportions cont’d…
• Solution: The estimation (CI) for the difference of population proportions should be
formed using the following formula (for a 95% confidence interval):
A.
Where ≈ 0.005.
pˆ1 1 pˆ 1 pˆ 2 1 pˆ 2
pˆ 1 pˆ 2 z 2 0.019 1.96 * 0.0035
n1 n2
ANOVA
327
Introduction
Here in the case of two independent sample t-test, we
have one continuous dependent variable (interval/ratio
data) and;
331
Analysis of variance cont…
The ANOVA uses data from all groups at a time to estimate standard
errors, which can increase the power of the analysis
Assumptions of One Way ANOVA)
A2
One way model: Data are deviations
from treatment means, Ais:
Xij = μ + Ai + Ɛij
A1 Sum of vertical deviations squared = SSe
n a n a a n
Within SS = Σ Σ (xij – j )2 = ΣiΣjxij2 - Σj(Σixij)2/n = SSW
i=1 j=1
n a a n
Between SS = Σ Σ ( ij – )2 = Σj(Σixij)2/n - (ΣiΣjxij)2 /na = SSB
i=1 j=1
This is assuming each of the ‘a’ groups has equal size, ‘n’.
337
Data of one way ANOVA
Groups/variable
G-1 G-2 G-3 ….. G-a
X11 X12 X13 ….. X1a
X21 X22 X23 ….. X2a
Participants
Computational formula
T= ΣiΣjxij2 Correction Factor = CF = (ΣiΣjxij)2 /na = T2../na
A = Σj(Σixij)2/n = Σj(T.j)2/n if the groups’ (cells’) size are equal, or
A = Σj(Σixij)2/nj = Σj(T.j)2/nj ; if unequal group size
Source of df SS MS F
variation
Between groups a-1 SSB = A - CF SSB/(a-1) MSB/MSW
Within groups na-a SSW = T - A SSW/(na –a)
Total na-1 SST = T - CF
Total 21 55232
Since the P-value is less than 0.05, the null hypothesis is rejected
Pair-wise comparisons of group means post hoc tests or multiple comparisons
Least Squares Difference (LSD) is the most liberal of the post hoc
tests and has a high Type I error rate. It is simply multiple t-tests
That is, it takes into account sample size as well as the number
of tests being performed.
ANOVA is a parametric test, examining whether the means differ between 2 or more populations.
It generates a test statistic F, which can be thought of as a signal: noise ratio. Thus large values of F indicate a high degree of pattern within the data and imply rejection of Ho.
354
University of Gondar
College of medicine and health science
Department of Epidemiology and Biostatistics
• December, 2018
University
of Gondar, Ethiopia
355
Sampling
It is not easy to collect all the information
about population and also it is not possible to
study the characteristics of the entire
population (finite or infinite) due to time factor,
cost factor and other constraints.
Thus we need sample.
Sample is a finite subset of statistical
individuals in a population and the number of
individuals in a sample is called the sample size.
07/09/2021 Wullo S.(MPH) 356
Sample Information
Population
07/09/2021 357
Common terms used in sampling
• Population: it is the collection of all items
of interest.
• Sampling: It is the method by which we
select a sample from the population
• Reference population (or target
population): the population of interest to
whom the researchers would like to make
generalizations.
07/09/2021 361
Error in Sampling cont---
2. Non Sampling Error (Measurement Error)
It is a type of systematic error in the design or conduct
of a sampling procedure which results in distortion of
the sample, so that it is no longer representative of the
reference population.
We can eliminate or reduce the non-sampling error
(bias) by careful design of the sampling procedure and
not by increasing the sample size.
It can occur whether the total study population or a
sample is being used.
07/09/2021 362
Sampling Methods
Two broad divisions:
A. Probability sampling methods
B. Non-probability sampling methods
A. Probability sampling methods
• Involves random selection of a sample
• Every sampling unit has a known and non-zero probability
of selection into the sample.
• Involves the selection of a sample from a population,
based on chance.
07/09/2021 363
Types of Sampling Methods
Samples
Method
Probability Samples
Non-Probability
Samples
Convenience
Multistage Random Sampling
Quota
07/09/2021 364
1. Simple random sampling
• The required number of individuals are selected at random from the
sampling frame, a list or a database of all individuals in the population .
07/09/2021 369
Example
• In a school based study, we assume students of
the same school are homogeneous.
07/09/2021 370
5. Multi-stage sampling
• Similar to the cluster sampling, except that it
involves picking a sample from within each
chosen cluster, rather than including all units in
the cluster.
• This type of sampling requires at least two stages.
• The primary sampling unit (PSU) is the sampling
unit in the first sampling stage.
• The secondary sampling unit (SSU) is the
sampling unit in the second sampling stage, etc.
07/09/2021 371
Woreda PSU
Kebele SSU
Sub-Kebele TSU
HH
07/09/2021 372
B. Non-probability sampling
• In non-probability sampling, every item has an
unknown chance of being selected.
07/09/2021 374
1. Convenience or haphazard sampling
• Convenience sampling is sometimes referred to as
haphazard or accidental sampling.
380
How many people to study?
381
If too many….
• Waste of resources!
382
If too few….
• May fail to detect an important effect
383
Sample size …
• Which variables should be included in sample size
calculation?
It should relate to the study’s primary outcome variable
If the study have secondary outcome variables which
are considered important, the sample size should also
be sufficient for the analysis of these variables.
• Answer depends on:
– How different or dispersed the population is.
– Desired level of confidence.
– Desired degree of accuracy.
– Desired margin of error
384
How to do we calculate a sample size
385
1. Rules of thumb approach
Different Views:
1. The larger the population size, the smaller the percentage of
the population required to get a representative sample
2. For smaller samples (N ‹ 100), there is little point in
sampling. Survey the entire population.
3. If the population size is around 500 50% should be sampled.
4. If the population size is around 1500, 20% should be
sampled.
5. Statistician – máxima list – at least 500
6. To make generalizations about entire population, need a
total sample size of 200-400
386
Some Considerations
07/09/2021 387
Summary
® Large-scale descriptive studies almost always
use probability-sampling techniques.
® Intervention studies sometimes use
probability sampling but also frequently use
non-probability sampling.
® Qualitative studies almost always use non
probability samples.
07/09/2021 388