Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
270 views

Chapter 1 Introduction

Uploaded by

G Gጂጂ Tube
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views

Chapter 1 Introduction

Uploaded by

G Gጂጂ Tube
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 388

University of Gondar

College of medicine and health science


Department of Epidemiology and Biostatistics

Basic Biostatistics

Wullo S. (MPH)

7/9/21 1
Chapter One
1.1 Introduction to Biostatistics

 Objectives of the chapter


 After completing this chapter, the student will be able to:
– Define Statistics and Biostatistics
– Identify the Branch of biostatistics
– Enumerate the importance and limitations of biostatistics
– Define and Identify the different types of data and
understand why we need to classify variables

07/09/2021 2
Definition and classification of Biostatistics

 Statistics is the science of


 collecting
 organizing
 Presenting
 analysing and drawing conclusion (inferences) from
data for the purpose of making decision.

 Biostatistics: The application of statistical methods


to the fields of biological and health sciences.

07/09/2021 3
Classification of Biostatistics

Descriptive biostatistics
 A statistical method that is concerned with the collection,
organization, summarization, and analysis of data from a
sample of population.
Inferential biostatistics
 A statistical method that is concerned with the drawing
conclusions/inference about a particular population by
selecting and measuring a random sample from the population.

07/09/2021 4
Cont…
B io s t a t is t ic s

D e s c r ip t iv e S t a t is t ic s I n f e r e n t ia l S ta t is t ic s

c o lle c t io n m a k in g in f e r e n c e s
o r g a n iz in g h y p o t h e s is t e s t in g
s u m m a r iz in g d e t e r m i n i n g r e l a t i o n s h ip
p r e s e n t in g o f d a ta m a k in g t h e p re d ic t io n

07/09/2021 5
Descriptive Biostatistics

• Some statistical summaries which are especially common in


descriptive analyses are:
 Measures of central tendency
 Measures of dispersion
 Measures of association
 Cross-tabulation, contingency table
 Histogram
 Quantile, Q-Q plot
 Scatter plot
 Box plot

07/09/2021 6
Inferential Biostatistics

07/09/2021 7
1.2  Stages in statistical investigation
There are five stages or steps in any statistical investigation.
1. Collection of data
 The process of obtaining measurements or counts.
2. Organization of data
 Includes editing, classifying, and tabulating the data
collected.
3. Presentation of data:
 overall view of what the data actually looks like.
 facilitate further statistical analysis.
 Can be done in the form of tables and graphs or diagrams.

07/09/2021 8
Cont…
4. Analysis of data
 To dig out useful information for decision making
 It involves extracting relevant information from the data
(like mean, median, mode, range, variance…),
5. Interpretation of data
 Concerned with drawing conclusions from the data
collected and analyzed; and giving meaning to analysis
results.
 A difficult task and requires a high degree of skill and
experience.

07/09/2021 9
1.3 Definition of Some Basic terms

Population: is the complete set of possible measurements for which


inferences are to be made.

Census: a complete enumeration of the population. But in most real


problems it cannot be realized, hence we take sample.
Sample: A sample from a population is the set of measurements that are
actually collected in the course of an investigation.
Parameter: Characteristic or measure obtained from a population.

Statistic: A statistic (rather than the filed of Statistics) refers to a


numerical quantity computed from sample data (e.g. the mean, the
median,
07/09/2021
the maximum). 10
Parameter and statistic

07/09/2021 11
Cont...
Sampling: The process or method of sample selection from the
population.
Sample size: The number of elements or observation to be
included in the sample.
variable is a characteristic or attribute that can assume different
values in different persons, places, or things.
Some examples of variables include:
 Diastolic blood pressure,
 heart rate, heights,
 The weights
Data: Refers to a collection of facts, values, observations, or
measurements that the variables can assume.

07/09/2021 12
Uses of statistics:

The main function of statistics is to enlarge our knowledge of


complex phenomena. The following are some uses of statistics:
 Estimating unknown population characteristics.
 Testing and formulating of hypothesis.

 Studying the relationship between two or more variable.


 Forecasting future events.
 Measuring the magnitude of variations in data.
 Furnishes a technique of comparison.

07/09/2021 13
Limitations of statistics

As a science statistics has its own limitations. The following are


some of the limitations:
 Deals with only quantitative information.

 Deals with only aggregate of facts and not with individual data
items.
 Statistical data are only approximately and not mathematical
correct.
 Statistics can be easily misused and therefore should be used
be experts.
07/09/2021 14
1.5 Types of Variables and Measurement Scales
A variable is a characteristic or attribute that can assume
different values in different persons, places, or things.
Examples :
 age,
 diastolic blood pressure,
 heart rate,
 the height of adult males,
 the weights of preschool children,
 gender of Biostatistics students,
 marital status of instructors at University of Gondar,
 ethnic group of patients

07/09/2021 15
A. Depending on the characteristic of the measurement, variable can be:
 Qualitative(Categorical) variable
 A variable or characteristic which cannot be measured in
quantitative form but can only be identified by name or categories,
 for instance place of birth, ethnic group, type of drug, stages of
breast cancer (I, II, III, or IV), degree of pain (minimal, moderate,
sever or unbearable).
 The categories should be clear cut, not overlapping, and cover all the
possibilities. For example, sex (male or female), vital status (alive or
dead), disease stage (depends on disease), ever smoked (yes or no).
07/09/2021 16
Quantitative(Numerical) variable:
 is one that can be measured and expressed numerically.
Example: survival time, systolic blood pressure, number of
children in a family, height, age, body mass index.
 they can be of two types
Discrete Variables
 Have a set of possible values that is either finite or
countabl infinite.
 The values of a discrete variable are usually whole
numbers.
 Numerical discrete data occur when the observations are
integers that correspond with a count of some sort.

07/09/2021 17
Some common examples are:

 Number of pregnancies,
 The number of bacteria colonies on a plate,
 The number of cells within a prescribed area upon microscopic
examination,
 The number of heart beats within a specified time interval,
 A mother’s history of numbers of births ( parity) and
pregnancies
 The number of episode of illness a patient experiences during
some time period, etc.

07/09/2021 18
Continuous Variables

 A continuous variable has a set of possible values including


all values in an interval of the real line.
 No gaps between possible values.
 Each observation theoretically falls somewhere along a
continuum.
Example: body mass index, height, blood pressure, serum
cholesterol level, weight, age etc.

07/09/2021 19
Con…
 Observations are not restricted to take on certain numerical
values: Often measurements (e.g., height, weight, age).
 Continuous data are used to report a measurement of the
individual that can take on any value within an acceptable
range.

07/09/2021 20
Nominal Scale

Level of measurement which classifies data into mutually exclusive, all


inclusive categories in which no order or ranking can be imposed on
the data.
 Assign subjects to groups or categories

 No order or distance relationship


 No arithmetic origin

 Only count numbers in categories


 Only present percentages of categories
 Chi-square most often used test of statistical significance
07/09/2021 21
Nominal Scale

Other Examples
Sex Social status
Marital status Days of the week (months)
Geographic location Seasons
Ethnic group Types of restaurants
Brand choice Religion
Job type : executive, technical, clerical

Coded as “0”
07/09/2021 Coded as “1” 22
Ordinal Scale
Level of measurement which classifies data into categories that can be
ranked. Differences between the ranks do not exist.

Classifies data according to some order or rank


With ordinal data, it is fair to say that one
response is greater or less than another.

E.g. if people were asked to rate the hotness of 3 chili


peppers, a scale of "hot", "hotter" and "hottest"
could be used. Values of "1" for "hot", "2" for
"hotter" and "3" for "hottest" could be assigned.

07/09/2021 23
Ordinal Scales

• Arithmetic operations are not applicable but relational


operations are applicable.
• Ordering is the sole property of ordinal scale.
Examples:
Letter grades (A, B, C, D, F).
Rating scales (Excellent, Very good, Good, Fair, poor).
Military status.

07/09/2021 24
Interval Scales
• Level of measurement which classifies data that can be ranked
and differences are meaningful. However, there is no meaningful
zero, so ratios are meaningless.
• All arithmetic operations except division are applicable.
• Relational operations are also possible.
Examples:
IQ
Temperature in oF.

07/09/2021 25
Interval Scale
Numerically equal distances on the scale represent equal values in
the characteristic being measured. An interval scale contains all the
information of an ordinal scale, but it also allows you to compare the
differences between objects.
assumes that the measurements are made in equal units.
i.e. gaps between whole numbers on the scale are equal.
e.g. Fahrenheit and Celsius temperature scales
an interval scale does not have a true zero.
 e.g. A temperature of "zero" does not mean that there
is no temperature...it is just an arbitrary zero point.
permissible statistics: count/frequencies, mode, median,

mean,
07/09/2021
standard deviation 26
Ratio Scales

• Level of measurement which classifies data that can be ranked,


differences are meaningful, and there is a true zero. True ratios
exist between the different units of measure.
• All arithmetic and relational operations are applicable.
Examples: Weight
Height
Number of students
Age

07/09/2021 27
Primary Scales of Measurement

Nominal Numbers
assigned to 4 81 9

runners

Ordinal Rank order of


winners

Third Second First


Place Place Place
Interval Performance
rating on a 0 to 8.2 9.1 9.6
10 Scale

Ratio Time to finish in


20 seconds 15.2 14.1 13.4
07/09/2021 28
STATISTICS
SCALE DESCRIPTIVE INFERENTIAL
Nominal Percentages, Mode Chi-square, Binomial test

Ordinal Percentile, Median Rank-order, Correlation,


ANOVA

Interval Range, Mean, SD Correlations, t-tests, ANOVA


Regression, Factor Analysis

Ratio Geometric Mean, Coefficient of Variation (CV)


Harmonic Mean
07/09/2021 29
Excercise
Categorize the following variables into nominal, ordinal, interval or
ratio

 Gender Height
 Grade(A, B, C, D and F ) Weight
 Rating scale(poor, good, excelent) Time
 Eye colour Age
 Political affilation IQ
Temprature
 Religious affilation
Salary
 Ranking of tennis players
 Majour field
 Nationality

07/09/2021 30
University of Gondar
College of medicine and health science
Department of Epidemiology and Biostatistics

Chapter two: Method of data collection and


presentation

Wullo S. (MPH)
Reading assignment
Methods of data collection

 The most common modes of collecting data can be


summarized as:
Observation
Personal (Face-to-Face) interview
Telephone
Group administered surveys
Mail
Web based survey
Combination of Methods
Methods of Data Presentation

 Objectives of the chapter


 After completing this chapter, the student will be able to:
– Identify the different methods of medical and biological
data organization and presentation
– Identify the criterion for the selection of a method to
organize and present data
Organization and Presentation of data

• Having collected and edited the data, the next important step is to
organize it.
• The process of arranging data in to classes or categories according to
similarities is called classification
• Classification is a preliminary and it prepares the ground for proper
presentation of data.
• The presentation of data is broadly classified in to the following two
categories:
• Tabular presentation
• Diagrammatic and Graphic presentation.

7/9/21 Wullo S. 34
Tabular presentation of data
• Frequency distribution: is the organization of raw
data in table form using classes and frequencies.
• Frequency: is the number of values in a specific class
of the distribution
• Raw data: recorded information in its original
collected form, whether it be counts or
measurements, is referred to as raw data.

7/9/21 Wullo S. 35
Frequency distribution (F.D.)…

 Frequency distribution can be grouped or ungrouped

The Ungrouped Frequency Distribution is also classified as:


– Ungrouped FD for Categorical Variables
– Ungrouped FD for Discrete Variables

i. Categorical Frequency Distribution


– Data are classified according to non-numerical categories.
– Categories must be mutually exclusive and exhaustive.
– Used to present nominal and ordinal data.
36
Ungrouped FD for Categorical Variables

Nominal data: Here the construction is straight


forward: count the occurrences in each category and
find the totals.
Example: The martial status of 60 adults classified as
single, married, divorced and widowed is presented in a
FD as below:
Marital Single Married Divorced Widowed Total
status
Frequency 25 20 8 7 60

37
Categorical F.D…

b) Ordinal data. The construction is identical to the


nominal case. How ever, the categories should be put
in an ordered manner.
Example: Satisfaction on teaching method in a class of
size 80 is presented in a FD as shown below.

Satisfaction Very Satisfied Dissatisfied Very Total


Satisfied dissatisfied
Frequency 15 36 3 7 60

38
Ungrouped FD for Discrete Variables

a) Ungrouped Discrete Frequency Distribution


– Count the number of times each possible value is repeated.
Example: In a survey of 30 families, the number of children per
family was recorded and obtained the following data:
42 4 3 2 8 3 4 4 2 2 8 5 3 4 5 4 5 4 3 5 2 7 3 3 6
7 3 8 4.

39
Discrete/Ungrouped FD…

These individual observations can be arranged in ascending


order of magnitude to from an array: 2 2 2 2 2 3 3 3 3 3 3
3 4 4 4 4 4 4 4 4 5 5 5 5 6 7 7 8 8 8.
The distribution of children in 30 families would be:

No.of 2 3 4 5 6 7 8 Total
children
No. of family 5 7 8 4 1 2 3 30
(f)

40
Continuous/grouped F.D

b) Continuous Frequency Distribution


– Arise from continuous variables/data.
– Unlike for a discrete FD, a class can not be allocated to
each value of a continuous variable.
– Categories in to which the observations are distributed
are called classes or class intervals.
– Classes should be exhaustive and mutually exclusive.

41
Continuous/grouped F.D…
Example: Consider the following FD on wages of 100
workers in a factory.

Wage (CI) 40-44 45-49 50- 55-59 60- 65- 70-74 75-79
54 64 69
Freq. 6 9 15 17 20 13 12 8
CB’s 39.5- 44.5- 49.5- 54.5- 59.5- 64.5- 69.5- 74.5-
44.5 49.5 54.5 59.5 64.5 69.5 74.5 79.5

– workers earning between 40 and 44 birr (inclusive)


are grouped in to the first class, workers earning
between 45 and 49 birr (inclusive) are grouped in to
the second class, and so on.

42
Continuous/grouped F.D…

Steps in constructing continuous frequency distribution

1. Determine the number of classes (k): Number of items


belonging to a class.

– Decide k with the help of Sturge’s rule:


k = 1 + 3.322 log n, rounded up to the nearest integer.
Where n => number of observations
log => common logarithm (logarithm of 10).
Example: If n=10, k = 4.32 ≈4; if n=100, k= 7.644 ≈ 8; if n=
1000, k =10.96 ≈ 11.

43
Continuous/grouped F.D…

2. Determine the Class Width (w): The difference between the


successive class limits or class boundaries (may be upper or
lower) of a class.
Range
we use, w
k
Note that

“k” , rounded up to the nearest integer.

44
Continuous/grouped F.D…

3. Determine the Class Limits

 The lowest and height values that can be included in a class such

that there is gap between successive classes.

 The lower class limit of the first class should be the smallest

value of the observations.

 Add the size of a class on the lower class limit to obtain the

lower class limit of the next higher classes.

45
Cont…
 To find the upper limit of the first class, subtract U from

the lower limit of the second class. Then continue to add

the class width to this upper limit to find the rest of the

upper limits or

 Obtain the upper class limits by adding class width minus

one to the corresponding lower class limits. i.e. UCL

=LCL+ (W-1)
Continuous/grouped F.D…
4. Determine the Class boundaries
– Let U =LCL of a class – UCL of preceding class. Add half of
this difference (U/2) to all upper class limits to get the upper
class boundaries (UCBs), and subtract (U/2) from all lower
class limits to get the lower class boundaries (LCBs).
– UCBi = UCLi +U/2
– LCBi = LCLi – U/2

Where Units of measurement (U): the distance between two


possible consecutive measures. It is usually taken as 1, 0.1,
0.01, 0.001, -----. 47
Cont…
5. Class mark (C.M) or Mid points: it is the average of the
lower and upper class limits or the average of upper and lower
class boundary.
5. Determine the frequency of each class: determined simply
by counting the number of observations belonging to each
class.
6. Cumulative frequency: is the number of observations less
than/more than or equal to a specific value.
7. Cumulative frequency above: it is the total frequency of
all values greater than or equal to the lower class boundary of
a given class.

8. Cumulative frequency blow: it is the total frequency of


all values less than or equal to the upper class boundary of a
given class.
9. Relative frequency (rf): it is the frequency divided by the
total frequency.
10. Relative cumulative frequency (rcf): it is the cumulative
frequency divided by the total frequency.
Example 1
The blood glucose level for 50 patients is shown below.

Construct a frequency distribution for the following data.

44 50 79 63 66 54 56 70 56 63

60 87 60 70 59 60 62 88 71 53

56 65 74 80 51 83 69 77 69 50

58 42 43 85 43 75 55 60 58 49

72 67 55 77 48 45 61 47 44 61
Solution:

Step 1: Find the highest and the lowest value H=88, L=42

Step 2: Find the range; R=H-L=88-42=46.

Step 3: Select the number of classes desired using

Sturges formula;

k=1+3.322log (50) =6.64=7(rounding up)

Step 4: Find the class width; w=R/k=46/7=6.57=7

(rounding up)
….Cont
Step 5: Select the starting observation as lowest class limit (this is
usually the lowest observation). Add the width to that observation
to get the lower limit of the next class. Keep adding until there are
7 classes.
42, 49, 56, 63, 70, 77, 84 are the lower class limits.
Step 6: Find the upper class limit; e.g. the first upper class=42-
U=49-1=48
48, 55, 62, 69, 76, 83, 90 are the upper class limits.
So combining step 5 and step 6, one can construct the following
classes.
So combining step 5 and step 6, one can construct the following classes.

Class limits
42-48
49-55
56-62
63-69
70-76
77-83
84-90
Step 7: Find the class boundaries by subtracting 0.5 from each lower class limit and
adding 0.5 to the UCL as shown.

and
LCBi  LCLi  U 2 UCBi  UCLi  U 2
Example: For class 1 = 42-0.5=41.5 and
UCB1  48  0.5  48.5
Then continue adding W on both boundaries to obtain the rest boundaries. By
doing so one can obtain the following classes.
Class boundary
41.5 – 48.5
48.5 – 55.5
55.5 – 62.5
62.5 – 69.5
69.5 – 76.5
76.5 – 83.5
83.5 – 90.5
Step 8: Tally the data.
Step 9: Write the numeric values for the tallies in the
frequency column.
Step 10: Find cumulative frequency.
Step 11: Find relative frequency and /or relative
cumulative frequency.
The complete frequency distribution follows

Class Class Class Freq. <CF >CF RF <RCF >RCF


limits boundary Mark

42-48 41.5 – 48.5 45 8 8 50 0.16 0.16 1

49-55 48.5 – 55.5 52 8 16 42 0.16 0.32 0.84

56-62 55.5 – 62.5 59 13 29 34 0.26 0.58 0.68

63-69 62.5 – 69.5 66 7 36 21 0.14 0.72 0.42

70-76 69.5 – 76.5 73 6 42 14 0.12 0.84 0.28

77-83 76.5 – 83.5 80 5 47 8 0.10 0.94 0.16

84-90 83.5 – 90.5 87 3 50 3 0.06 1 0.06

Total 50 1
Continuous/grouped F.D…

Example 2: Construct a continuous FD for the following


raw data of ages of patients admitted at felege hiwot
hospital in a given week.

57, 53, 65, 55, 50, 45, 64, 52, 16, 46,
42, 63, 33, 64, 53, 25, 54, 35, 48, 55,
70, 47, 39, 58, 52, 36, 65, 75, 26, 20,
55, 60, 83, 61, 45, 63, 49, 42, 35, 18,
51, 45, 42, 65, 39, 59, 45, 41, 30, 40.
56
Continuous/grouped F.D…
Solution:
i. Using the Struges’ rule, the number of classes is:
k= 1+ 3.322 log 50 =6.64 ≈ 7.
ii. Range = highest value – lowest value
= 83 –16= 67.
Range 67
w   9.57  10
iii) Class width k 7

iv) Since the smallest value is 16, the LCL1 is 16 and the
UCL1 is 25; and the frequency distribution would look like:

57
Continuous/grouped F.D…
Here is the FD:
Ages Freq.
16-25 4
26-35 5
36-45 12
46-55 14
56-65 12
66-75 2
76-85 1
Total 50
58
Continuous/grouped F.D…
Example: The class marks and class boundaries of the
above Example are:
CL Freq. CM CB
16-25 4 20.5 15.5-25.5
26-35 5 30.5 25.5-35.5
36-45 12 40.5 35.5-45.5
46-55 14 50.5 45.5-55.5
56-65 12 60.5 55.5-65.5
66-75 2 70.5 65.5-75.5
76-85 1 80.5 75.5-85.5
Total 50
59
Continuous/grouped F.D…
Cumulative frequency distributions
 Tells us how often the values fall below or above
that class. There are two types of CFD:

The “less than” cumulative F.D.


 Obtained by adding the frequency of all the
preceding classes including the frequency of that
class.
The “more than” cumulative F.D.
 Obtained by adding the frequency of the succeeding
classes including the frequency of that class.

60
Continuous/grouped F.D…
Example: For the data in the above Example, both
cumulative frequency distributions are given below:

Less than cum.freq. More than cum.freq.


Marks Cum. Marks Cum.
Freq. Freq.
<25.5 4 >15.5 50
<35.5 9 >25.5 46
<45.5 21 >35.5 41
<55.5 35 >45.5 29
<65.5 47 >55.5 15
<75.5 49 >65.5 3
<85.5 50 >75.5 1

61
Diagrammatic and Graphical Methods of Data
Presentation
A F.D can be presented graphically or diagrammatically.
Advantages
• To understand the information easily.
• To make the data attractive.
• To make comparisons of items easy.
• To draw attention of the observer.
 The purpose of graphs and diagrams is not to provide exact and
detailed information, but simple comparisons. Any further
information shall rather be obtained from the original data.
62
2.2.2 Diagrammatic Presentation of Data

Diagrams are appropriate for presenting discrete as well as


qualitative data.
The three most commonly used diagrammatic presentation of
data are:
 Pie charts
Bar charts
 Pictograms
Pie Chart

 Pie chart can used to compare the relation between


the whole and its components
 Pie chart is important for depicting discrete variables
with relatively few categories.
 Pie chart is a circular diagram and the area of the
sector of a circle is used in pie chart.
Cont…
Steps in constructing a pie-chart
1. Construct a frequency table

2. Change the frequency into percentage (P) or fraction (F)


3. Change the percentages into degrees, where:

4. Draw a circle and divide it accordingly


Component Part
Angle of Sector  x360 0
Total
Example

Example2.4: The following table gives the details of monthly


budget of a family. Represent these figures by a suitable
diagram.
Example
Example

300

600
100

400
100

food clothing House Rent Fuel and Light misclaneous


Bar Chart

The bar chart (simple, multiple and stacked bar graph)


used to represent and compare the frequency distribution
of discrete variables and attributes or categorical
series.
• The vertical or horizontal bins to represent the
frequencies of a distribution. While we draw bar chart,
we have to consider the following points. These are (see
the following slide)
Cont…
Tips for constructing bar chart
1. Whenever possible it is better to construct a bar diagram on
a graph paper
2. All bars drawn in any single study should be of the same
width

3. The different bars should be separated by equal distances


4. All the bars should rest on the same line called the base
5. Whenever possible, it is advisable to draw bars in order of
magnitude
Simple Bar Chart

• is used to represents data involving only one


variable classified on spatial, quantitative or
temporal basis.
Example Draw simple bar diagram to represent the
profits of a bank for 5 years.
Multiple Bar Chart

• are used two or more sets of inter-related data are represented


(multiple bar diagram facilities comparison between more than
one phenomenon).

Example : Draw a multiple bar chart to represent the import and


export of Canada (values in $) for the years 1991 to 1995.
Example
Component Bar Chart
• is used to represent data in which the total magnitude is
divided into different or components
Example 2.7: The table below shows the quantity in hundred kgs
of Wheat, Barley and Oats produced on a certain form during
the years 1991 to 1994. Draw stratified bar chart.
Example
2.2.3 Graphical Presentation of data
The histogram, frequency polygon and cumulative frequency graph
(ogive) are most commonly applied graphical representation for
continuous data.
Procedures for constructing statistical graphs
• Draw and label the X and Y axes.
• Choose a suitable scale for the frequencies or cumulative
frequencies and label it on the Y axes.
• Represent the class boundaries for the histogram or ogive and the
mid points for the frequency polygon on the X axes.
• Plot the points.
Box plots

• A visual picture called box (box-and-whisker )plot can be used


to convey a fair amount of information about the distribution of
a set of data.
• It is used as an exploratory data analysis tool

• The box shows the distance between the first and the third
quartiles,

• The median is marked as a line within the box and

• The end lines show the minimum and maximum values


respectively
78
Box plots cont…
Box plot is the five-number summary:

The minimum entry


Q1
Q2 (median)
Q3
The maximum entry
The quartiles are sets of values which divide the
distribution into four parts such that there are an equal
number of observations in each part.
 Q1 = [(n+1)/4]th
 Q2 = [2(n+1)/4]th
 Q3 = [3(n+1)/4]th
Box plots cont…

Example: Use the following age data of 15 patients to draw a box-


and-whisker plot.

Min Q1 Q2 Q3 Max

35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
Illustration of Box-plot using the age of 15 patients

Notice the distribution of


data in each
quarter(distance between
quartiles)

81
A box-plot indicating the distribution of blood lead level of
individuals by sex

82
Histogram

• A graph which places the class boundaries on the


horizontal axis and the frequencies on a vertical axis
• Class marks and class limits are some times used as
quantity on the X axes.
• Example: Construct a histogram to by using the
following data
Example*:
The blood glucose level for 50 patients is shown below.
Construct a frequency distribution for the following data.

44 50 79 63 66 54 56 70 56 63

60 87 60 70 59 60 62 88 71 53

56 65 74 80 51 83 69 77 69 50

58 42 43 85 43 75 55 60 58 49

72 67 55 77 48 45 61 47 44 61
Example*

Class Class Class Freq. <CF >CF RF <RCF >RCF


limits boundary Mark

42-48 41.5 – 48.5 45 8 8 50 0.16 0.16 1


49-55 48.5 – 55.5 52 8 16 42 0.16 0.32 0.84
56-62 55.5 – 62.5 59 13 29 34 0.26 0.58 0.68
63-69 62.5 – 69.5 66 7 36 21 0.14 0.72 0.42
70-76 69.5 – 76.5 73 6 42 14 0.12 0.84 0.28
77-83 76.5 – 83.5 80 5 47 8 0.10 0.94 0.16
84-90 81.5 – 90.5 87 3 50 3 0.06 1 0.06
Total 50 1
Histogram
14
12
Number of Patients

10
8
6
4
2
0
41.5 – 48.5 – 55.5 – 62.5 – 69.5 – 76.5 – 41.5 –
48.5 55.5 62.5 69.5 76.5 83.5 48.5
Blood Glucose Level
FrequencyPolygon
Line graph of class marks against class frequencies.
To draw a frequency polygon we connect the midpoints of
class boundaries of the histogram by a straight line.
Frequncy (N umber of

14
12
10
Patients)

8
6
4
2
0
38 45 52 59 66 73 80 87 94
Class Marks (Blood Glucose Level)
Ogive (cumulative frequency polygon)
• A graph showing the cumulative frequency (less than or more than
type) plotted against upper or lower class boundaries respectively.
• That is class boundaries are plotted along the horizontal axis and
the corresponding cumulative frequencies are plotted along the
vertical axis.
• The points are joined by a free hand curve.
• Example: Draw an ogive curve(less than type) for the above data.
(Example *)
Ogive Graph (Cumulative Less Than Type)

60

50

40

30

20

10

0
41.5 48.5 55.5 62.5 69.5 76.5 83.5 90.5
University of Gondar
College of Medicine and Health Sciences
Institute of Public Health
Department of Epidemiology and Biostatistics

Chapter three: Descriptive statistics


Prepared BY: Department of Epidemiology and Biostatistics

University of
Gondar, Ethiopia
Leaning outcomes
• After completing this chapter a student will able to;
• List and calculate measures of central tendency
• List and calculate measures of dispersion
• List and calculate measures of shape

• Use spss software for summary measures

07/09/2021 91
Average should posses the following properties:

– It should be rigidly defined.


– It should be based on all observation under
investigation.
– It should be as little as affected by extreme
observations.
– It should be capable of further algebraic
treatment.
– It should be as little as affected by fluctuations
of sampling.
– It should be ease to calculate and simple to
understand.
Measures of Central Tendency/ Measures of Location

• Measures of central Tendency: the various


methods of determining the actual value at
which the data tend to concentrate.
• Hence, measures of central Tendency is a value
which tends to sum up or describe the mass of
the data in to single value.
• These central tendency includes:
– Mean ,
– Median and
– Mode .
07/09/2021 93
Arithmetic Mean/simple Mean ( )
•• Definition:
  the arithmetic mean is the sum of all observations
divided by the number of observations. It is usually denoted
by

,xif x’s are population observations


Population mean: μ 
N

• Let us consider X1, X2, ..., XN are the list of “n”


measurements obtained from “n” subjects
Then the mean for ungrouped number of
measurements for n subjects is defined as:

07/09/2021 94
Example
• Consider
  the data on birth weight of 10 new born
children in kilo gram at university of Gondar
hospital:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88,
2.43.
Then the average birth weight can be computed as:

07/09/2021 95
Arithmetic mean cont…
When the data are arranged or given in the
form of frequency distribution i.e. there are k
variety values such that a value Xi has a
frequency f i ( i=1,2,---,k) ,then the Arithmetic
mean will be given as
Example

Solution:
130
𝑓 𝑖 × 𝑥𝑖
 
18× 2+19× 1+ 20× 4+ …+29 ×12
´𝑥 =∑ =
𝑖=1 ∑ 𝑓𝑖 2+1+4+ …+12

  3180
¿ =24.46 ≈ 25
130

07/09/2021 97
Exercise
• Consider the following frequency distribution
table
Data value 10 20 30 40 50 60 70 80 90 100 110
Frequency 3 5 6 8 10 12 15 10 10 12 5

calculate the average of this data set?

07/09/2021 98
Mean for Grouped Data?

• In calculating the mean from grouped data, we


assume that all values falling into a particular
class interval are located at the midpoint of each
interval.
  Therefore,
∑ mean for grouped data is
𝑓 𝑖 × 𝑥𝑚
𝑥=
´
calculated as: 𝑛

07/09/2021 99
Mean for Grouped Data
Example: calculate the mean for the grouped
distribution table given below:
Class Frequency
6-10 35
11-15 23
16-20 15
21-25 12
26-30 9
31-35 6

07/09/2021 100
Example cot…
•  Solution
Class Class mid (Xm Frequency fi × Xm
6-10 8 35 280
11-15 13 23 299
16-20 18 15 270
21-25 23 12 276
26-30 28 9 252
31-35 33 6 198
Total   100 1,575

• Therefor
07/09/2021 101
Properties of the arithmetic mean
– The mean can be used as a summary measure for both discrete
and continuous data, in general however, it is not appropriate for
either nominal or ordinal data.
– For a given set of data there is one and only one arithmetic mean.

– Algebraic sum of the deviations of the given values from their


arithmetic mean is always zero.

– The mean is used in computing other statistics, such as the


variance

– The mean is affected by extremely high or low values, called


outliers, and may not be the appropriate average to use in these
situations
102
Reading assignment
 What is Geometric mean?
What is harmonic mean?
 Combined mean
Weighted mean

07/09/2021 103
Median
• An alternative measure of central location, perhaps
second in popularity to the arithmetic mean.
• Suppose there are n observations in a sample.
• If these observations are ordered from smallest to
largest, then the median is defined as follows:
• The median, is a value such that at least half of the
observations are less than or equal to median and
at least half of the observations are greater than or
equal to median .
• The median is the midpoint of the data array.

07/09/2021 104
Median
Ungrouped data
• If the number of observations is odd, the median is defined
as the [(n+1)/2]th observation.
• If the number of observations is even the median is the
average of the two middle (n/2)th and [(n/2)+1]th values i.e
• To find the median of a data set:
– Arrange the data in ascending order.
– Find the middle observation of this ordered data.
Example1: where n is even: 19, 20, 20, 21, 22, 24, 27, 27,
27, 34
• Then, the median = (22 + 24)/2 = 23

105
Example 2
The number of children with asthma during a specific year in
seven local districts clinic is shown. Find the median for this
data set.
253, 125, 328, 417, 201, 70, 90
Solution:
First we must arrange the data in ascending order
70, 90, 125, 201, 253, 328, 417
Therefore, the fourth observation is the median of the data, i.e.
the value 201 is the median value

07/09/2021 106
Exercise
• The actual waiting time for the first job on the
selected sample of nine people having
different field of specialization was given
below.

waiting time(in months): 11.6,11.3, 10.7, 18.0,


3.3, 9.2, 8.3, 3.8, 6.8
• calculate the median of the waiting time

07/09/2021 107
Median cont…
Median for grouped data.
-If data are given in the shape of continuous frequency
distribution, the median is defined as:

Where: Lmed =lower class boundary of the median class.


f med= The frequency of the median class
f c= The cumulative frequency less than type preceding the
median class .
W=the size of the median class.
n=total number of observation.
Note: The median class is the class with the smallest
cumulative frequency (less than type) greater than or equal
to n/2.
Median for grouped data cont…
Example; find the median for the following
distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 4
Median for grouped data …
• Solution
Class Frequency Cumulative frequency
40-44 7 7
45-49 10 17
50-54 22 39
55-59 15 54
60-64 12 66
65-69 6 72
70-74 4 76
Total 76 76

¿
 

07/09/2021 110
Median in grouped data …
•  
¿
 

07/09/2021 111
Merits and Demerits of Median
Merits:
• Median is a positional average and hence not influenced by extreme observations.

• Can be calculated in the case of open end intervals.

• The median can be used as a summary measure for


ordinal, discrete and continuous data, in general
however, it is not appropriate for nominal data.
Demerits:

• It is not a good representative of data if the number of items is small.

• It is not amenable to further algebraic treatment.

• It is vulnerable to sampling fluctuations .


07/09/2021 112
Mode
• Mode is a value which occurs most frequently in a set of values
• The mode may not exist and even if it does exist, it may not be unique.
• In case of discrete distribution the value having the maximum
frequency is the model value.
• If in a set of observed values, all values occur once or equal number of
times, there is no mode
• Examples:

1. Find the mode of 5, 3, 5, 8, and 9 ; Mode = 5


2. Find the mode of 8, 9, 9, 7, 8, 2, 5; Mode = 8 and 9

3. Find the mode of 4, 12, 3, 6, and 7. No mode/ mode doesn’t exist.


07/09/2021 113
Mode cont…

• NB: The mode for grouped data is modal class.


Modal class is the class with the largest
frequency.

07/09/2021 114
Mode for Grouped data
In grouped data, we usually refer to the modal class, class
with highest frequency. If a single value for the mode of
grouped data must be specified, it is taken as:
1
Mode  L  w
1   2

Where L = The lower class boundary of the modal class;


 1  f mod  f 1  2  f mod  f 2

w = the size of the modal class


f1= frequency of the class preceding the modal class.
f2= frequency of the class succeeding the modal class
fmod = frequency of the modal class.
115
The Mode…
Example: Calculate the modal age for the age
distribution of 228 patients below.

Class interval Number of women

15-19 6
20-24 19
25-29 50
30-34 57
35-39 48
40-44 27
45-49 21
Total 228

116
The Mode…
Solution: By inspection (simply looking at the
frequencies), the mode lies in the fourth class, where L
=29.5, fmod = 57, f1=50, f2=48, w = 5, and
 1  57  50  7,  2  57  48  9

Therefore, the modal age,


7
x̂  29.5  5
79
 29.5  2.2
 31.7
117
Properties of Mode
• The mode can be used as a summary measure for nominal,
ordinal, discrete and continuous data, in general however, it
is more appropriate for nominal and ordinal data.
• It is not affected by extreme values

• It can be calculated for distributions with open end classes


• Sometimes its value is not unique
• The main drawback of mode is that it may not exist

118
Merits and Demerits of Mode

Merits:
 It is not affected by extreme observations.
 Easy to calculate and simple to understand.
 It can be calculated for distribution with open end class.

Demerits:
 It is not rigidly defined.
 It is not based on all observations.
 It is not suitable for further mathematical treatment.
 It is not stable average, i.e. it is affected by fluctuations of sampling
to some extent.
 Often its value is not unique.

119
Quartiles

- Quartiles are measures that divide the frequency


distribution in to four equal parts.
- The value of the variables corresponding to these
divisions are denoted Q1, Q2, and Q3 often called the
first, the second and the third quartile respectively.
- Q1 is a value which has 25% items which are less
than or equal to it.
- Similarly Q2 has 50% items with value less than or equal
to it and
- Q3 has 75 items whose values are less than or equal to
it.
Quartile
•  Steps to calculate quartiles for ungrouped data;
Arrange the data in increasing order
If the number of observation is
A. odd: item,

B. Even:

07/09/2021 121
Quartiles

•  

W iN
Qi  LQi  (  C ), i  1,2,3
f Qi 4
Measure of variation/dispersion
Definition:

 The scatter or spread of items of a distribution is known as


dispersion or variation.
• In other words the degree to which numerical data tend
to spread about an average value is called dispersion
or variation of the data.
 Measures of dispersions are statistical measures which
provide ways of measuring the extent in which data are
dispersed or spread out.

07/09/2021 123
Measure of variation cont…
A good measure of variation posses:
• It should be easy to compute and understand.
• It should be based on all observations.
• It should be Uniquely defined
• It should be capable of further algebraic treatment.
• It should be as little as affected by extreme values

07/09/2021 124
Measure of variation Cont…
Absolute and relative measures
Measures of dispersion may be either absolute or relative
1. Absolute measures of dispersion (AMD): Absolute
measure is expressed in the SI unit in which the original data
are given such as kilograms, tones etc.
• These measures are suitable for comparing the variability in two
distributions having variables expressed in the same units and of the
same averaging size.
• These measures are not suitable for comparing the variability in two
distributions having variables expressed in different units.
Measure of variation cont…

2. Relative measures of dispersion (RMD): used to


compare the dispersion in two sets of data, when the
variables are measured in different units.
For example, we may wish to know, for a certain population,
whether serum cholesterol levels, measured in milligrams per
100 ml, are more variable than body weight, measure in
kilograms.
Furthermore, although the same unit of measurement is used,
the two MCT (means) may be quite different.
Types of measure of dispersion

Various measures of dispersions are in use. The most commonly used measures of dispersions are;

 Absolute measure of  Relative measure of


dispersion dispersion
• Range • Relative range
• Variance • Coefficient of quartile
• Quartile deviation deviation
• Mean deviation • Coefficient of mean
deviation
• Standard deviation
• Coefficient of variation
• Standard score
RANGE:
• It is the difference between the largest and smallest
observation from the data
• EXAMPLE: Consider the data on the weight (in Kg) of 10 new
born children at university of Gondar hospital within a month:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43
Solution:
• the range for the dataset can be computed by first
arranging all observation in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43, 2.51, 2.88, 2.98, 3.01, 3.25.
• Range = Maximum - Minimum=3.25-1.98=1.27

07/09/2021 128
Range cot…
• It is based upon two extreme cases in the entire distribution,
the range may be considerably changed if either of the
extreme cases happens to drop out, while the removal of
any other case would not affect it at all.
• It wastes information , it takes no account of the entire data.

07/09/2021 129
Quartiles and Inter-quartile Range, Percentiles

• The inter-quartile range (IQR) is the difference between the third


and the first quartiles.

• Example: Consider the age data of 15 patients to find IQR

Q1 Q2 Q3

35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
• IQR = 48 – 37 = 11

130
Quartile deviation (QD)
 The range expresses the extreme variability of
observations of a variable.
 is half of the inter quartile range.

 Inter quartile Range =Q3-Q1

Inter quartile range Q3  Q1


QD  
2 2

131
Coefficient of quartile deviation (CQD)

It gives the average amount by which the


two quartiles differ from the median.

Q3  Q1
CQD 
Q3  Q1

132
Variance and Standard Deviation
• Variance measure how far on average scores
deviate or differ from the mean.

• Variance is the average of the square of the


distance each value from the mean

07/09/2021 133
07/09/2021 134
Variance
I.e. The sample variance, denoted by s2 , of a set of n
 

observed values having a mean is the sum of the


squared deviations divided by n -1

For the case of frequency distribution it is expressed


as:

135
Standard deviation
• There
  a problem in a variance because the
deviations are squared and its units also square,
in order to get the original unit of measurement
we insert in to square root.

07/09/2021 136
Standard cont…
• Consider the following three datasets

– Dataset 1:7, 7, 7, 7, 7, 7 Mean=7, s.d=0

– Dataset 2: 6, 7, 7, 7, 7, 8, mean=7, s.d=0.63

– Dataset 3: 3, 2, 7, 8, 9, 13, mean=7, s.d=4.04

– We understand that the same mean but different


variation
07/09/2021 137
Example 1
Find the variance and standard deviation based on the given
data set given bellow?
35, 45, 30, 35, 40, 25
Solution
Firstly we find the mean

Next subtract the mean from each value and square it:
X X-

07/09/2021 138
Cont…
•Sum
  up all the squared values

And then divide the sum to (n-1) to get the variance

Insert the variance to square root to get standard deviation?

07/09/2021 139
Exercise
• The Areas of spray able surfaces with DDT from a sample of 15
houses are measured as follows (in m2) :
101,105,110,114,115,124,125,125,130,133,135,136,13 7,140,145

Find the variance and standard deviation of the given


data set?

07/09/2021 140
Example 2
• Find the variance and the standard deviation for
the frequency distribution of the given data set
Class
below. Frequency Midpoint
5.5 – 10.5 1 8
10.5 – 15.5 2 13
15.5 – 20.5 3 18
20.5 – 25.5 5 23
25.5 – 30.5 4 28
30.5 – 35.5 3 33
35.5 – 40.5 2 38

07/09/2021 141
Cot…
•  Solution

Class Frequenc Midpoint fi.xm fi.(Xm-)2


y
5.5-10.5 1 8 8 1*(8-24.5)2= 272.25
10.5-15.5
10.5-15.5 2
2 13
13 26
26 2*(13-24.5)
2*(13-24.5)2 == 264.5
2
264.5
15.5-20.5
15.5-20.5 3
3 18
18 54
54 3*(18-24.5)
3*(18-24.5)2 == 126.75
2
126.75
20.5-25.5 5 23 115 5*(23-24.5)2 = 11.25
2
20.5-25.5 5 23 115 5*(23-24.5)2 = 11.25
25.5-30.5 4 28 112 4*(28-24.5) = 49
25.5-30.5 4 28 112 4*(28-24.5)22 = 49
30.5-35.5 3 33 99 3*(33-24.5) = 216.75
30.5-35.5
35.5-40.5 3
2 33
38 99
76 3*(33-24.5)
2*(38-24.5)2 == 216.75
2
364.5
35.5-40.5
Total 2
n = 20 38 76
490 2*(38-24.5)
1,305 2
= 364.5
Total n = 20 490 1,305
07/09/2021 142
Cot…
Therefore
•   variance is calculated based on the
formula:

• The standard deviation is the square root of


variance

07/09/2021 143
Special properties of standard deviation /variance

1. If the standard deviation of X1, X2,…Xn is S , then the standard deviation


of
a) x1  k , x2  k , x3  k ,..., xn  k will also be s
b) kx1 , kx 2 , kx3 ,..., kx n would be k s
c) a  kx1 , a  kx 2 , a  kx3 ,..., a  kx n would be k s

144
Special properties of standard deviation

•2.   If a sample of n1 observations has a variance and a


sample of n2 observations have a variance of, then the
combined variance called the pooled variance (Sp2) is
given by:
2 n  1 S   n  1 S
2 2

S p  1 1 2 2

n n 2
1 2

145
Coefficient of variation
• The standard deviation is an absolute measure of deviation of
observations around their mean and is expressed with the same
unit of the data.
• Due to this nature of the standard deviation it is not directly used
for comparison purposes with respect to variability.
• Coefficient of variation, is often used for this purpose
• The coefficient of variation (CV) is defined by:

CV =

• The coefficient of variation is most useful in comparing the


variability of several different samples, each with different means.

07/09/2021 146
Examples:
1. An analysis of the monthly wages paid (in Birr) to
workers in two firms A and B belonging to the same
pharmaceutical industry gives the following results

07/09/2021 147
07/09/2021 148
Coefficient of variation cont…
Exercise
2. A meteorologist interested in the consistency of
temperatures in three cities during a given week collected
the following data. The temperatures for the five days of
the week in the three cities were
City 1 : 25 24 23 26 17
City2 : 22 21 24 22 20
City3 : 32 27 35 24 28
Which city have the most consistent temperature, based
on these data?

07/09/2021 149
When to use coefficient of variance
• When comparison groups have very different means (CV
is suitable as it expresses the standard deviation relative to
its corresponding mean)
• When different units of measurements are involved,
e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)
• In such cases, standard deviation should not be used for
comparison

07/09/2021 150
Exercise
1. Based on the given data set given below
a. Calculate mean, median and mode
b. Calculate variance, standard deviation and coefficient of
variation
15, 7, 13, 9, 10, 11
2. Calculate variance and standard deviation for the following data set
geven below;
5, 17, 12, 10, 8

07/09/2021 151
Standard Score
If X is a measurement from a distribution
  with mean and standard
deviation S, then its value in standard units is

Z: gives the deviations from the mean in units of standard deviation.


Z: gives the number of standard deviation a particular observation
lie above or below the mean.
It is used to compare two observations coming from different
groups.
Standard score cont..
Examples:
1. Two sections were given introduction to Bio-
statistics examinations. The following information
was given.
value section1 section2
mean 78 90
sd 6 5
Student A from section 1 scored 90 and student B
from section 2 scored 95. Relatively speaking who
performed better?
Cont…
Solutions:
Calculate the standard score of both students.

 Student A performed better relative to his section because


the score of student A is two standard deviation above the
mean score of his section while, the score of student B is only
one standard deviation above the mean score of his section.
Standard score cont…
Exercise: Two groups of people were trained to perform a
certain task and tested to find out which group is faster to learn
the task. For the two groups the following information was
given:
Value Group one Group two
Mean 10.4 min 11.9 min
Stand.dev. 1.2 min 1.3 min
Relatively speaking:
a) Which group is more consistent in its performance
b) Suppose a person A from group one take 9.2 minutes while
person B from Group two take 9.3 minutes, who was faster in
performing the task? Why?
Moments

• The rth moment about the mean (the rth central


moment) defined as:
mr 
 i
( X  X ) r

r  0,1,2, 
n
• for continuous grouped data it is given by:

mr 
 fi ( X i  X )r
n

where Xi’s are class marks.

156
Example:

Find the first three central moments of the


numbers 2, 3 and 7
Solution: mean = (2+3+7)/3 = 4
 m1 
 i
( X  X ) 1


(2  4)  (3  4)  (7  4)
0
n 3

m2 
 i
( X  X ) 2


(2  4) 2  (3  4) 2  (7  4) 2
 4.67
n 3

m3 
 i
( X  X ) 3


(2  4) 3  (3  4) 3  (7  4) 3
6
n 3

157
Measures of shape
a. Skewness
• Skewness is the degree of asymmetry or
departure from symmetry of a distribution.
• A skewed frequency distribution is one that is not
symmetrical.
• Skewness is concerned with the shape of the curve not size.
• If the frequency curve (smoothed frequency polygon) of a
distribution has a longer tail to the right of the central
maximum than to the left, the distribution is said to be
skewed to the right or said to have positive Skewness.

07/09/2021 158
Concept of skewness
• The skewness of a distribution is defined as
the lack of symmetry.
• In a symmetrical distribution, mean, median,
and mode are equal to each other.

07/09/2021 159
Skewness
• If it has a longer tail to the left of the central
maximum than to the right, it is said to be
skewed to the left or said to have negative
Skewness.
• For moderately skewed distribution, the
following relation holds among the three
commonly used measures of central
tendency.
 Mean-Mode=3*(Mean-Median)

07/09/2021 160
Skewness
Measures of Skewness
The Karl Pearson’s Coefficient of Skewness (SK):
Mean  Mode 3( Mean  Median )
Sk  Sk 
S tan dard deviation S tan dard deviation

If SK = 0, then the distribution is symmetrical.

If SK > 0, then the distribution is positively skewed.

If SK < 0, then the distribution is negatively skewed.


161
Remarks Related with Skewness
• In a positively skewed distribution, smaller observations are
more frequent than larger observations i.e. the majority of
the observations have a value below an average and it has a
long tail in the positive direction.

07/09/2021 162
Remarks Related with Skewness
 In a negatively skewed distribution, smaller
observations are less frequent than larger
observations i.e. the majority of the observations
have a value above an average

07/09/2021 163
Kurtosis
• Kurtosis is the degree of peakdness of a distribution, usually taken
relative to a normal distribution.
 The peakdness of a distribution be classified in to three:

• Leptokurtic: -
- A distribution having relatively high peak
- A large number of observations have same values
• Mesokurtic: -
- Normal peak
- The curve is properly peaked
• Platykurtic:
 Flat toped
 A large number of observations have low frequency are
spread in the middle interval.
07/09/2021 164
Kurtosis

07/09/2021 165
Measures of kurtosis

•-  The moment coefficient of Skewness

m4
2 
 m2  2

• If =3, then the distribution is Mesokurtic.


• If >3, then the distribution is leptokurtic.
• If <3, then the distribution is Platykurtic.

07/09/2021 166
You
a n k
Th

07/09/2021 167
University of Gondar
College of Medicine and Health Sciences
Institute of Public Health
Department of Epidemiology and Biostatistics

Chapter Four: Probability and probability distributions

Wullo S. (MPH)

7/9/21
Learning outcomes
After studying this chapter, the student will be able to:
4.1 Define basic terms in probability
4.2 Describe set theory and probability
4.3 Identify types of probability
4.4 Identify types of random variable and probability distribution
4.5 List common probability distributions and their
properties

07/09/2021 169
PROBABILITY CONCEPTS

Introduction
Probability lays the foundation for statistical inference
This chapter provides a brief overview of the probability
concepts necessary for understanding topics covered in
the chapters that follow
It also provides a context for understanding the probability
distributions used in statistical inference

07/09/2021 170
Basic Terms of Probability
• Probability can be defined as the chance of an event
occurring.
• Probability experiment: is a process that leads to well-
defined results or is an action through which specific
results/outcomes (counts, measurements or responses)
are obtained. But that is the result cannot be predicted.
Example:
• Tossing a coin and observing the face showing up is a
probability experiment.
• Outcome: It is the result of a single trial in a probability
experiment. It is also called simple event.
Example: the outcome of the sex of a newborn from a
mother in delivery room is either Male or female
07/09/2021 171
Basic concepts con'td….

– Sample space: Each conceivable outcome of an


experiment is called a sample point. The totality of all
sample points is called a sample space and i s denoted
by S.
– Event: An outcome or a combination of outcomes of a
random experiment is called an event. It is a subset of
the sample space of a random experiment.
– Equally-likely Approach: If an experiment must result in
n equally likely outcomes, then each possible outcome
must have probability 1/n of occurring.
– Mutually exclusive events: when the occurrence of any
one event excludes the occurrence of the other event.
Mutually exclusive events cannot occur simultaneously.

7/9/21 Wullo S. 172


Basic terms…

•  Certain event: An event which is sure to occur.


 Impossible event: An event which can't occur.
 Complement of an event: The complement of event A
(denoted by ), consists of all the sample points in the
sample space that are not in A.

07/09/2021 173
Exercise
Find the sample space for the gender of the children
if a family has three children. Use B for boy and G
for girl
And also find:
a. The probability of obtaining at least two girls in a
family?
b. The probability of getting at most two boys in a family?
c. The probability of getting one boys and two girls in a
family?

07/09/2021 174
Types of probability
1. Classical (or theoretical) probability
 It is used when each outcome in a sample space is
equally likely to occur.
 That is if an experiment has n equally likely outcomes,
then each possible outcome must have probability of 1/n
to occur Or, equivalently the probability for event E is;

Example: The probability of getting at least one female birth


from two pregnant mothers is: ¾ = 0.75

07/09/2021 175
Types of probability cont…
2. Empirical (or statistical) probability: is based on
observations obtained from experiments /a large
number of trials or from historical data.

Example:
• A medical doctor realized that out of 100,000 patients
visited the hospital, there are 50 cancer cases. What
is the probability that a patient to be examined will be
positive for cancer?
P(+ve for cancer) = 50/100,000 = 0.0005
07/09/2021 176
Example 2
In a sample of 50 people, 21 had type O blood, 22 had type A
blood, 5 had type B blood, and 2 had type AB blood. Set up
a frequency distribution and find the following probabilities
a. A person has type O blood
b. A person has type A or type B blood
c. A person has neither type A nor type O blood
d. A person does not have type AB blood

07/09/2021 177
Solution
Blood type Frequency
A 22
B 5
AB 2
O 21
Total 50

 P(o) = 21/50 = 0.42


 P(A)= 22/50 = 0.44
 P (A or B)=p(A)+P(B)= 22/50+5/50=27/50
 Do others in this way?

07/09/2021 178
• Union of events: The union of two events A and B, denoted
by (AUB) , consists of all outcomes that are in A or in B or
both A and B.
 If A and B are two events, then
 P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
 If A and B are mutually exclusive/independent, then
 P(A ∪ B) = P(A) + P(B)
Example
In a hospital unit there are 8 nurses and 5 physicians; 7 nurses
and 3 physicians are females. If a staff person is selected,
find the probability that the subject is:
a. nurse or a male
b. physician or female

07/09/2021 179
•Solution
 
Staff Gender
Male Female Total
Physician 2 3 5
Nurse 1 7 8
Total 3 10 13

The probability is
P(N M) = P(N) + P(M) - P(MN)
= 8/13+3/13-1/13
= 0.615 + 0.23-0.077
= 0.768

07/09/2021 180
3. Subjective probability
• In this view, probability is treated as a quantifiable level of belief
ranging from 0 (complete disbelief) to 1 (complete belief)

• For instance, an experienced physician may say “this patient has a


50% chance of recovery.”

• An appreciation of the various types of probability are not mutually


exclusive. And fortunately, all obey the same

• mathematical laws, and their methods of calculation are similar.

• All probabilities are a type of relative frequency—the number of


times something can occur divided by the total number of
possibilities or occurrences.
7/9/21 Wullo S. 181
Rules of Probability
• Any probability assigned must be a nonnegative number.
• The probability of the sample space (the collection of all possible
outcomes) is equal to 1.
• The probability of A or B involves addition.

• P(A or B) = P(A) + P(B) if the two are mutually exclusive.

• The probability of A and B involves multiplication

• P(A and B) = P(A) P(B) if the two are independent


• P( Not A) = 1- P(A)
• P(At least one) = 1- P(none)
• P(none) = P(each event not happening)^number of events
7/9/21 Wullo S. 182
•  
Intersection of events: The intersection of event A and B,
denoted by (AB), consists of all outcomes that are in both A
and B.
P(A B) = P(A) * P(B\A) if the two events are dependent
P(A B) = P(A) * P(B) if the two events are independent

07/09/2021 183
Conditional probability

  occurs given that the first event A


The probability that the second event B
has occurred can be found by dividing the probability that both events
occurred by the probability that the first event has occurred. The formula is

P(A/B) =P(AB)/P(B)

Where: P(B)≠0
Special case: When events A and B are independent, then:
P(A|B) = P(A)
P(AB)=P(A)P(B)

07/09/2021 184
Example
• In a certain high school class, consisting of 60 girls and 40
boys, it is observed that 24 girls and 16 boys wear
eyeglasses. If a student is picked at random from this class,
the probability that the student wears eyeglasses, P(E), is
40/100, or 0.4.
a. What is the probability that a student picked at random wears
eyeglasses, given that the student is a boy?
solution

b. What is the probability that a student picked at random wears


eyeglasses,
07/09/2021
given that the student is a girls? 185
Example
• The following data shows the association between aspirin
use and heart attack.
• Table 4.1: Data for treatment versus Myocardial Infarction
Myocardial Infarction
Treatment
Yes No Total
Placebo 100 500 600
Aspirin 60 900 960
Total 160 1400 1560
Let us define A and B as, positive for Myocardial Infarction
And Aspirin used respectively
07/09/2021 186
Example cot…
•  Find;
A. P(A/B), B. p(B/A)
C. Are the characteristics of A and B independent
Solution:
A. P(A/B) = P(A n B)/P(B) = 60/1560 ÷ 960/1560 = 0.0625
B. P(B/A) = P(B n A)/P(A) = 60/1560 ÷ 160/1560 = 0.375
C. To test independency p(A/B) = p(A) or p() = p(A)p(B)
Therefore: P(A/B) = 0.0625 where as p(A) = 160/1560 =0.103

Now P(A/B) p(A) i.e. 0.0625


So, the characteristics of A and B are not independent, i.e.
they are dependent
07/09/2021 187
Counting rules of probability

• We have three different counting rules.


– Basic multiplication rule
– Permutations
– Combinations

07/09/2021 188
Factorial
For any positive integer n, n factorial denoted as
n! is defined as:
n! = n×(n-1) ×(n-2) ×…. ×3x2x1
e.g. 3!=3x2x1=6
5!= 5x4x3x2x1=120

07/09/2021 189
Permutation rules
• Permutation: is the number of possible
permutations is the number of different orders in
which particular events occur. The number of
possible permutations are

p( n, r) =

example: 6p2=

07/09/2021 190
Combination
The number of ways r objects can be chosen a set of n
objects without considering the order of selection is
called the number of combination of n objects taking r
of them at a time, denoted by

C(8,6) =

C(8,0)=

07/09/2021 191
Probability Distribution

Definition of random variables and probability


Distribution
a variable: is defined as a characteristic or attribute that can
assume different values.
Random variable: - is numerical valued function defined on
the sample space.
 A random variable: is a variable whose values are
determined by chance
 Generally a random variables are denoted by capital letters
X,Y,Z…and the value of the random variables are denoted by
small letters x, y, z

07/09/2021 192
• Discrete random variables: have a finite number of possible
values or an infinite number of values that can be counted
– The word counted means that they can be enumerated
using the numbers 1, 2, 3, etc
– Variables that can assume all values in the interval
between any two given values are called continuous
variables
– Continuous random variables can assume an infinite
number of values and can be decimal and fractional
values

07/09/2021 193
Examples of discrete random variable:
• Toss a coin “n” time and count the number of heads.
• number of car accidents per week.
• Number of defective items in a given company.
• Number of bacteria per two cubic centimeter of water
Examples of continuous random variable:
• Height of students at certain college.
• Mark of a student.
• Life time of certain disease .
• Length of time required to complete a given training

07/09/2021 194
The probability distribution of a discrete random variable is a
table, graph, formula, or other device used to specify all
possible values of a random variable along with their
respective probabilities
Example:
Consider the experiment of tossing a coin three times. Let X be
the number of heads. Construct the probability distribution
of X

X 0 1 2 3
P(x) 1/8 3/8 3/8 1/8

07/09/2021 195
Example 2:
Construct a probability distribution for rolling a single die.
Solution
Since the sample space is 1, 2, 3, 4, 5, 6 and each outcome
has a probability of , the distribution

X 1 2 3 4 5 6
p(x) 1/6 1/6 1/6 1/6 1/6 1/6

07/09/2021 196
Two requirements for probability distribution
•• The
  sum of the probabilities of all events in the sample
space must be equal to 1; i.e.

• The probabilities of each event in the sample space must be


between or equal to 0 and 1; i.e.

07/09/2021 197
Properties of continuous probability distribution
1.

1. The total area under the curve is one i.e.  f ( x)  1




2. P(a  X  b)  the area under the curve between the


point a and b.
3. P X   0
4. P X  a   0
5. P a  X  b  P a  X  b  P(a  X  b)  P(a  X  b)
b
P( a  x  b)    f(x) dx
a

07/09/2021 198
Introduction to expectation
Definition: the expected value (also known as the
mean) of a random variable is a measure of the
center location for the random variable.
1. Discrete R.V

n
E(X) = X1P(X­1) +X2P(X2) +…. +XnP(Xn) =  X .P X  i i
i 1

2. Continuous R.V
b
E X    X . f ( x)d ( x)
a

07/09/2021 199
Variance Probability distribution
• The expected value of X is its mean
Mean of X= E(X)
• The variance of X is given by:
Variance of X=Var(x) = E  X 2   ( E  X  ) 2

n
E ( X )   X i .P  X i 
2 2
if X is discrete
i 1

  X 2 f  x d ( x) if X is continuous
x
07/09/2021 200
Example
Let X be a continuous R.V with distribution
1
 x 0 x2
f ( x)   2
0, otherwise

Then find
a) P (1<x<1.5)
b) E(x)
c) Var(x)
d) E (3x 2  2 x)

07/09/2021 201
o n
r ib uti
ist
t y D
i li
o b ab
te pr
re
D isc
07/09/2021 202
1. Binomial Distribution

A binomial experiment is a probability experiment that satisfies


the following four requirements called assumptions of a
binomial distribution.
– The experiment consists of n identical trials.
– Each trial has only one of the two possible mutually
exclusive outcomes, success or a failure.
– The probability of success does not change from trial to
trial, and
– The trials are independent, thus we must sample with
replacement

07/09/2021 203
Binomial distribution Cont..
•  
Definition: The outcomes of the binomial experiment and the
corresponding probabilities of these outcomes are called
BinomialDistribution.
If the probability of success on an individual trial is P,
then the binomial probability is defined by:

Where:
– x=the number of success
– P=probability of success
– n=the number of experiments
– 1-p=probability of failure
07/09/2021 204
•When
  using the binomial formula to solve problems, we have to
identify three things:
 The number of trials (n)
 The probability of a success on any one trial (P) and

 The number of successes desired (x)


• We call the distribution for random variable X Binomial
Distribution and often XBinom(n, p).

07/09/2021 205
Example: Suppose that an examination consists of six true and
false questions, and assume that a student has no
knowledge of the subject matter. The probability that the
student will guess the correct answer to the first question is
30%. Likewise, the probability of guessing each of the
remaining questions correctly is also 30%.
a) What is the probability of getting exactly three correct
answers?
b) What is the probability of getting exactly two correct
answers?
c) What is the probability of getting at most two correct
answers?
d) What is the probability of getting less than five correct
answers?
e. Find expected value and standard deviation?

07/09/2021 206
•Solution:
 
a.

b. 07/09/2021 207
•c.  

d.

07/09/2021 208
Exercise
1. Suppose 14 percent of mothers admitted to smoking one or
more cigarettes per day during pregnancy. If a random
sample of size 10 is selected from this population, what is
the probability that it will contain exactly four mothers who
admitted to smoking during pregnancy?
2. Suppose that 80% of adults with allergies report
symptomatic relief with a specific medication. If the
medication is given to 10 new patients with allergies, what is
the probability that it is effective in exactly seven? assume
that the replications are independent.
07/09/2021 209
2.Poisson distribution
The probability distribution of a Poisson random variable X
representing the number of successes occurring in a given time
interval or a specified region of space is given by the formula:

Where
• k=Number of successes per unit time

• e=The base of the natural logarithm

• λ= The expected number of successes per unit time


• If λ is the average number of successes occurring in a given time
interval or region in the Poisson distribution, then the mean and
the variance of the Poisson distribution are both equal to λ.
07/09/2021 210
The following statements describe what is known as the
Poisson process
1. The occurrences of the events are independent. The
occurrence of an event in an interval1 of space or time
has no effect on the probability of a second occurrence
of the event in the same, or any other, interval

2. Theoretically, an infinite number of occurrences of the


event must be possible in the interval.

3. The probability of the single occurrence of the event in a


given interval is proportional to the length of the interval.
.

07/09/2021 211
Example:
In a study of drug-induced anaphylaxis among patients taking
rocuronium bromide as part of their anesthesia, the
occurrence of anaphylaxis followed a Poisson distribution
with λ =12 incidents per year in Norway. Find the probability
that in the next year, among patients receiving rocuronium,
a. exactly three will experience anaphylaxis.
b. At least two will experience anaphylaxis
c. At most two experience anaphylaxis

07/09/2021 212
•  
Solution:
a.

b.

07/09/2021 213
Exercise
In a certain population an average of 13 new cases of
esophageal cancer are diagnosed each year. If the annual
incidence of esophageal cancer follows a Poisson
distribution, find the probability that in a given year the
number of newly diagnosed cases of esophageal cancer will
be:
A. Exactly 10 cases
B. At least three cases
C. No more than 3
D. Between nine and 12, inclusive
E. Fewer than two

07/09/2021 214
CONTINUOUS PROBABILITY DISTRIBUTIONS

• If a random variable is a continuous variable, its probability


distribution is called a continuous probability distribution.
• A continuous probability distribution differs from a discrete
probability distribution in several ways by:
• Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.

07/09/2021 215
Normal distribution

The normal distribution refers to a family of continuous


probability distributions described by the normal equation
and described as follows:

where
• X is a normal random variable,
• μ is the mean
• σ is the standard deviation
• pi is approximately 3.14159, and e is approximately 2.71828.
• The random variable X in the normal equation is called the
normal random variable.

07/09/2021 216
Characteristics of Normal Distribution
• It links frequency distribution to probability distribution
• Has a Bell Shape Curve and is Symmetric
• It is Symmetric around the mean: Two halves of the
curve are the same (mirror images)
• Hence Mean = Median=mode
• The total area under the curve is 1 (or 100%)
• Normal Distribution has the same shape as Standard
Normal Distribution.

07/09/2021 217
Normal Curve
• The graph of the normal distribution depends on two factors:
 the mean and the standard deviation.
• The mean of the distribution determines the location of the center of the
graph, and the standard deviation determines the height and width of the
graph.
• When the standard deviation is large, the curve is short and wide; when
the standard deviation is small, the curve is tall and narrow.
• All normal distributions look like a symmetric, bell-shaped curve.

07/09/2021 218
Standard Normal Distribution
• It makes life a lot easier for us if we standardize our normal
curve, with a mean of zero and a standard deviation of 1
unit.
• We can transform all the observations of any normal random
variable X with mean μ and variance σ to a new set of
observations of another normal random variable Z with mean
0 and variance 1 using the following transformation:

07/09/2021 219
• About 95% of the area under the curve falls within 2
standard deviations of the mean
• About 99.7% of the area under the curve falls within 3
standard deviations of the mean
• A graph of this standardized (mean 0 and variance 1) normal
curve is given in Graph:

07/09/2021 220
Probability and Normal Distributions

• We know that the area under any normal curve is 1 unit


 Therefore, we can link these areas with probability
i.e. if a random variable, x, is normally distributed, the probability that x
will fall in a given interval is the area under the normal curve for that
interval.
 Or P(a < x < b) = area under the curve between a and b.

• There is no probability attached to any single value of x.


• That is, P(x = a) = 0.
07/09/2021 221
Probability and Normal Distribution
• For the solution of problems using the standard normal
distribution, a two-step process is recommended
Step 1: Draw the normal distribution curve and shade the
area.
Step 2: Find the appropriate figure in the Procedure
Table and follow the directions given

07/09/2021 222
Table of normal distribution
• Example 1: Suppose we want to compute the area
under the normal curve to the left of 1.45
• This area can be computed by finding the probability under
the normal curve.
• The probability can be read at the normal curve by combining
the value of 1.4 under the first column and 0.05 under the
first row.
• The left side of the area in the diagram represents the area
that is within 1.45 standard deviations from the mean.
• The area of this shaded portion is 0.9265(or 92.65% of the
total area under the curve).

07/09/2021 223
07/09/2021 224
Example:
Find the area to the left of z = 2.06
Solution
Step 1: Draw the figure

07/09/2021 225
Step2: We are looking for the area under the standard normal
distribution to the left of z = 2.06, It is 0.9803. Hence, 98.03%
of the area is less than z = 2.06.

07/09/2021 226
Find the area between z = 1.68 and z =-1.37.
Solution
Step 1: Draw the figure as shown.

Step 2 Since the area desired is between two given z values, look up
the areas
corresponding to the two z values and subtract the smaller area from the
larger area. (Do not subtract the z values.) The area for z=1.68 is 0.9535,
and the area for z= -1.37 is 0.0853. The area between the two z values is
0.9535 - 0.0853 = 0.8682 or 86.82%

07/09/2021 227
Example:
For subject A, a 27-year-old female, the ammonia concentration
in parts per billion (ppb) followed a normal distribution over 30
days with mean 491 and standard deviation 119.What is the
probability that on a random day, the subject’s ammonia
concentration is between 292 and 649 ppb?
Solution:
We find the z value corresponding to an x of 292 by

07/09/2021 228
The area desired is the difference between these, 0.9082
-0.0475 = 0. 8607.
Exercise:
1. For another subject (a 29-year-old male), the acetone levels
were normally distributed with a mean of 870 and a standard
deviation of 211 ppb. Find the probability that on a given day
the subject’s acetone level is:
a. Between 600 and 1000 ppb
b. Over 900 ppb
c. Under 500 ppb
d. Between 900 and 1100 ppb

07/09/2021 229
2. If the total cholesterol values for a certain population are
approximately normally distributed with a mean of 200
mg\100 ml and a standard deviation of 20 mg\100 ml, find the
probability that an individual picked at random from this
population will have a cholesterol value:
a. Between 180 and 200 mg\100 ml
b. Greater than 225 mg\100 ml
c. Less than 150 mg\100 ml
d. Between 190 and 210 mg\100 ml

07/09/2021 230
Student t-distribution
• It is often the case that one wants to calculate the size
of sample needed to obtain a certain level of confidence
in survey results.
• Unfortunately, this calculation requires prior knowledge
of the population standard deviation σ.
• Realistically, σ is unknown
• Often a preliminary sample will be conducted so that a
reasonable estimate of this critical population parameter
can be made.
• If such a preliminary sample is not made, but
confidence intervals for the population mean are to be
constructing using an unknown σ, then the distribution
known
07/09/2021 as the Student t distribution can be used. 231
Student’s t-distribution cont…
•• Suppose
  we have a simple random sample of size n
drawn from a Normal population with mean μ and
standard deviation σ. Let us denote the sample mean
by and sample standard deviation by s, then the
quantity:

has a t distribution with n-1 degrees of freedom.

The degrees of freedom are the number of values that are


free to vary after a sample statistic has been computed

07/09/2021 232
Some properties of t-distribution are;
The t distribution shares some characteristics of the normal
distribution and differs from it in others. The t distribution is
similar to the standard normal distribution in these ways:
1. It is bell-shaped.
2. It is symmetric about the mean.
3. The mean, median, and mode are equal to 0 and are located
at the center of the distribution.
.Converges to the normal distribution as the sample size gets
large
5. The curve never touches the x axis.

07/09/2021 233
The t distribution differs from the standard normal distribution in
the following ways:
 The variance is greater than 1.
 The t distribution is actually a family of curves based on
the concept of degrees of freedom, which is related to
sample size.
 As the sample size increases, the t distribution
approaches the standard normal distribution

07/09/2021 234
o u v e ry
a nk y
Th ! !!
M uc h

07/09/2021 235
University of Gondar
College of medicine and health science
Department of Epidemiology and Biostatistics

• Estimation and Hypothesis Testing

Prepared By: Department of Epidemiology and Biostatistics

University
of Gondar, Ethiopia

236
Objectives
• After complete this session you will be able to
do
– Parameter estimations
– Point estimate
– Confidence interval
– Hypothesis testing
– Z-test
– T-test
– Analysis of variance
237
Introduction # 1
• Inferential is the process of generalizing or
drawing conclusions about the target
population on the basis of results obtained
from a sample.

238
Introduction #2
• Before beginning statistical analyses
– it is essential to examine the distribution of the variable for
skewness (tails),
– kurtosis (peaked or flat distribution), spread (range of the
values) and
– outliers (data values separated from the rest of the data).
• Information about each of these characteristics
determines to choose the statistical analyses and can
be accurately explained and interpreted.

239
Sampling Distribution

• The frequency distribution of all these samples forms the sampling


distribution of the sample statistic

240
Sampling distribution .......

 Three characteristics about sampling distribution of a statistic


 its mean
 its variance
 its shape

 Due to random variation different samples from the same


population will have different sample means.
 If we repeatedly take sample of the same size n from a
population the means of the samples form a sampling
distribution of means of size n is equal to population mean.
 In practice we do not take repeated samples from a population
i.e. we do not encounter sampling distribution empirically, but it
is necessary to know their properties in order to draw statistical
inferences.
241
The Central Limit Theorem

• Regardless of the shape of the frequency distribution of a


characteristic in the parent population,
– the means of a large number of samples (independent
observations) from the population will follow a normal
distribution (with the mean of means approaches the
population mean μ, and standard deviation of σ/√n ).
• Inferential statistical techniques have various assumptions that
must be met before valid conclusions can be obtained
• Samples must be randomly selected.
• sample size must be greater (n>=30)
• the population must be normally or approximately
normally distributed if the sample size is less than 30.
242
Sampling Distribution......

•  

243
Sampling Distribution ..........

•  

244
Standard deviation and Standard error

Standard deviation is a measure of variability between


individual observations (descriptive index relevant to
mean)
Standard error refers to the variability of summary
statistics (e.g. the variability of the sample mean or a
sample proportion)
Standard error is a measure of uncertainty in a sample
statistics i.e. precision of the estimate of the estimator
Parameter Estimations

• In parameter estimation, we generally assume that the


underlying (unknown) distribution of the variable of interest is
adequately described by one or more (unknown) parameters,
referred as population parameters.

• As it is usually not possible to make measurements on every


individual in a population, parameters cannot usually be
determined exactly.

• Instead we estimate parameters by calculating the


corresponding characteristics from a random sample
estimates .
• the process of estimating the value of a parameter from
information obtained from a sample.
246
Estimation

Estimation is a procedure in which we use the information


included in a sample to get inferences about the true
parameter of interest.
• An estimator is a sample statistic that used to estimate the
population parameter while an estimate is the possible values
that a given estimator can assume.
Properties of a good estimator

Sample statistic Corresponding population parameter


(Sample mean) μ (population mean)
S2 (sample variance) σ2 (population variance)
S (sample Standard deviation) σ (population standard deviation)
(Sample proportion) P (Population proportion)
A desirable property of a good estimator is the following
• It should be unbiased: The expected value of the estimator must be
equal to the parameter to be estimated.
• It should be consistent: as the sample size increase, the value of the
estimator should approaches to the value of the parameter estimated.
• It should be efficient: the variance of the estimator is the smallest.
• It should be sufficient: the sample from which the estimator is calculated
must contain the maximum possible information about the population.
Types of Estimation

There are two types of estimation:

1. Point estimation: It uses the information in the sample to


arrive at a single number (that is called an estimate) that is
intended to be close to the true value of the parameter.

2. Interval estimation: It uses the information of the sample to


end up at an interval (i.e. construct 2 endpoints) that is
intended to enclose the true value of the parameter.

249
Point Estimation

•  

 x
p=
n
250
Example

•  
Some BLUE estimators

252
Interval Estimation
• However the value of the sample statistic will vary from
sample to sample therefore, to simply obtain an
estimate of the single value of the parameter is not
generally acceptable.
– We need also a measure of how precise our estimate is likely
to be.
– We need to take into account the sample to sample variation
of the statistic.
• A confidence interval defines an interval within which
the true population parameter is like to fall (interval
estimate).
253
Confidence Intervals…

• Confidence interval therefore takes into account the sample to sample


variation of the statistic and gives the measure of precision.
• The general formula used to calculate a Confidence interval is Estimate
± K × Standard Error, k is called reliability coefficient.
• Confidence intervals express the inherent uncertainty in any medical
study by expressing upper and lower bounds for anticipated true
underlying population parameter.
• The confidence level is the probability that the interval estimate will
contain the parameter, assuming that a large number of samples are
selected and that the estimation process on the same parameter is
254
repeated.
Confidence intervals…

 Most commonly the 95% confidence intervals are calculated,


however 90% and 99% confidence intervals are sometimes used.
 The probability that the interval contains the true population
parameter is (1-α)100%.
 If we were to select 100 random samples from the population
and calculate confidence intervals for each, approximately 95 of
them would include the true population mean B (and 5 would
not)

255
Confidence interval ……

A (1-α) 100% confidence interval for unknown population mean


and population proportion is given as follows;

 
 [ x  z . , x  z . ]
2 n 2 n
 
 [ p  z . p(1  p) / n , p  z . p(1  p) / n ]
2 2

256
Interval estimation

257
258
259
260
Confidence intervals…

• The 95% confidence interval is calculated in such a way that, under the
conditions assumed for underlying distribution, the interval will contain true
population parameter 95% of the time.
• Loosely speaking, you might interpret a 95% confidence interval as one which
you are 95% confident contains the true parameter.
• 90% CI is narrower than 95% CI since we are only 90% certain that the interval
includes the population parameter.
• On the other hand 99% CI will be wider than 95% CI; the extra width meaning
that we can be more certain that the interval will contain the population
parameter. But to obtain a higher confidence from the same sample, we must
be willing to accept a larger margin of error (a wider interval). 261
Confidence intervals…

• For a given confidence level (i.e. 90%, 95%, 99%) the


width of the confidence interval depends on the
standard error of the estimate which in turn depends
on the
– 1. Sample size:-The larger the sample size, the narrower
the confidence interval (this is to mean the sample statistic
will approach the population parameter) and the more
precise our estimate. Lack of precision means that in
repeated sampling the values of the sample statistic are
spread out or scattered. The result of sampling is not
repeatable.
262
Confidence intervals…

•  

263
Confidence interval for a single mean

CI =

Most commonly, we used to compute 95% confidence


interval, however, it is possible to compute 90% and 99%
confidence interval estimation.
Table 1: Normal distribution
Area between 0 and z

  0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 265
Confidence interval ……

• If the population standard deviation is unknown and the sample size is


small (<30), the formula for the confidence interval for sample mean is:

– x is the sample mean


– s is the sample standard deviation
– n is the sample size
– t is the value from the t-distribution with (n-1) degrees of freedom
266
The t Distribution
df t0.100 t0.050 t0.025 t0.010 t0.005
--- ----- ----- ------ ------ ------
1 3.078 6.314 12.706 31.821 63.657 t D is trib utio n: d f = 1 0
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841 0 .4
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707 0 .3
7 1.415 1.895 2.365 2.998 3.499 Area = 0.10 Area = 0.10
8 1.397 1.860 2.306 2.896 3.355

}
f(t)
0 .2
9 1.383 1.833 2.262 2.821 3.250
10 1.372 1.812 2.228 2.764 3.169
11 1.363 1.796 2.201 2.718 3.106 0 .1
12 1.356 1.782 2.179 2.681 3.055
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977 0 .0
15 1.341 1.753 2.131 2.602 2.947 -1.372 0 1.372
-2.228 2.228
16 1.337 1.746 2.120 2.583 2.921

}
17 1.333 1.740 2.110 2.567 2.898 t
18 1.330 1.734 2.101 2.552 2.878
19 1.328 1.729 2.093 2.539 2.861 Area = 0.025 Area = 0.025
20 1.325 1.725 2.086 2.528 2.845
21 1.323 1.721 2.080 2.518 2.831
22
23
1.321
1.319
1.717
1.714
2.074
2.069
2.508
2.500
2.819
2.807 Wheneverisisnot
Whenever notknown
known(and
(andthe
thepopulation
populationisis
24
25
1.318
1.316
1.711
1.708
2.064
2.060
2.492
2.485
2.797
2.787 assumednormal),
assumed normal),thethecorrect
correctdistribution
distributiontotouse
useisis
26
27
1.315
1.314
1.706
1.703
2.056
2.052
2.479
2.473
2.779
2.771 thet tdistribution
the distributionwith
withn-1
n-1degrees
degreesofoffreedom.
freedom.
28
29
1.313
1.311
1.701
1.699
2.048
2.045
2.467
2.462
2.763
2.756 Note,however,
Note, however,that
thatfor
forlarge
largedegrees
degreesof offreedom,
freedom,
30
40
1.310
1.303
1.697
1.684
2.042
2.021
2.457
2.423
2.750
2.704 thet tdistribution
the distributionisisapproximated
approximatedwellwellbybythe
theZZ
60
120
1.296
1.289
1.671
1.658
2.000
1.980
2.390
2.358
2.660
2.617 distribution.
distribution.
 1.282 1.645 1.960 2.326 2.576
267
Point and Interval Estimation of the Population Proportion (p)

We will now consider the method for estimating the binomial


proportion p of successes, that is, the proportion of elements in a
population that have a certain characteristic.
A logical candidate for a point estimate of
ˆ 
p
thexpopulation proportion p
is the sample proportion , where x is thennumber of observations
in a sample of size n that have the characteristic of interest. As we
have seen in sampling distribution of proportions, the sample
proportion is the best point estimate of the population proportion.
Proportion…
 The shape is approximately normal provided n is sufficiently
large - in this case, nP > 5 and nQ > 5 are the requirements for
sufficiently large n ( central limit theorem for proportions) .
 The point estimate for population proportion π is given by þ.

 A (1-α)100% confidence interval estimate for the unknown

population proportion π is given by:


CI =  p  Z  (1   ) / n , p  Z

 (1   ) / n 
   
 2 2 
 If the sample size is small, i.e. np < 5 and nq < 5, and the
population standard deviations for proportion are not given,
then the confidence interval estimation will take t-distribution
instead of z as:
269
Example 1:

 A SRS of 16 apparently healthy subjects yielded the following values of urine


excreted (milligram per day);
0.007, 0.03, 0.025, 0.008, 0.03, 0.038, 0.007, 0.005, 0.032, 0.04, 0.009, 0.014,
0.011, 0.022, 0.009, 0.008
Compute point estimate of the population mean
If x1 , x 2 , ..., x n are n observed values , then
n

 xi
0.295
x= i =1
  0.01844
n 16

Construct 90%, 95%, 98% confidence interval for the mean


(0.01844-1.65x0.0123/4, 0.01844+1.65x0.0123/4)=(0.0134, 0.0235)
(0.01844-1.96x0.0123/4, 0.01844+1.96x0.0123/4)=(0.0124, 0.0245)
(0.01844-2.33x0.0123/4, 0.01844+2.33x0.0123/4)=(0.0113, 0.0256) 270
Example 2

The mean diastolic blood pressure for 225 randomly


selected individuals is 75 mmHg with a standard
deviation of 12.0 mmHg. Construct a 95% confidence
interval for the mean
Solution
n=225
mean =75mmhg
Standard deviation=12 mmHg
confidence level 95%
The 95% confidence interval for the unknown population mean is
given
95%CI = (75 ±1.96x12/15) = (73.432,76.56)
Example 2:
AAstock
stockmarket
marketanalyst
analystwants
wantstotoestimate
estimatethe
theaverage
averagereturn
returnon
onaa
certainstock.
certain stock. AArandom
randomsample
sampleofof15
15days
daysyields
yieldsan
anaverage
average
(annualized)return
(annualized) returnof
of x  10.37 andandaastandard
standarddeviation
deviationof
ofss==3.5.
3.5.
Assumingaanormal
Assuming normalpopulation
populationof ofreturns,
returns,give
giveaa95%
95%confidence
confidence
intervalfor
interval forthe
theaverage
averagereturn
returnon
onthis
thisstock.
stock.

df
---
t0.100
-----
t0.050
-----
t0.025
------
t0.010
------
t0.005
------ The critical value of t for df = (n -1) = (15 -1)
1
.
3.078
.
6.314
.
12.706
.
31.821
.
63.657
. =14 and a right-tail area of 0.025 is:
t 0.025  2.145
. . . . . .
. . . . . .
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977 The corresponding confidence interval or
15 1.341 1.753 2.131 2.602 2.947 s
. . . . . .
interval estimate is: x  t 0. 025
.
.
.
.
.
.
.
.
.
.
.
. n
35
.
 10.37  2.145
15
 10.37  1.94
  8.43,12.31
272
Example 3:

• In a survey of 300 automobile drivers in one city, 123 reported


that they wear seat belts regularly. Estimate the seat belt rate of
the city and 95% confidence interval for true population
proportion.
• Answer : p= 123/300 =0.41=41%
n=300,
Estimate of the seat belt of the city at 95%
CI = p ± z ×(√p(1-p) /n) =(0.35,0.47)

273
Example 4:

In a sample of 400 people who were questioned regarding their participation in sports,
160 said that they did participate. Construct a 98 % confidence interval for P, the
proportion of P in the population who participate in sports.
Solution:
Let X= be the number of people who are interested to participate in sports.
X=160, n=400, =0.02, Hence
 Z  2  Z 0.01  2.33

ˆP  X  160  0.4  P2ˆ 


P(1  P)

0.4(0.6)
 0.0245
n 400 n 400
As a result, an approximate 98% confidence interval for P is given by:

Pˆ (1  Pˆ ) Pˆ (1  Pˆ )   (0.4  (2.33 * 0.0245)), (0.4  (2.33 * 0.0245


 Pˆ  Z  2  P  Pˆ  Z  2 )
n   0.345,0.457 
n
Hence, we can conclude that about 98% confident that the true proportion of people in
the population who participate in sports between 34.5% and 45.7%.
HYPOTHESIS TESTING

Introduction
– Researchers are interested in answering many types of
questions. For example, A physician might want to
know whether a new medication will lower a person’s
blood pressure.

– These types of questions can be addressed through


statistical hypothesis testing, which is a decision-
making process for evaluating claims about a
population.
275
Hypothesis Testing

• The formal process of hypothesis testing provides us with a


means of answering research questions.
• Hypothesis is a testable statement that describes the nature of
the proposed relationship between two or more variables of
interest.

• In hypothesis testing, the researcher must defined the


population under study, state the particular hypotheses that will
be investigated, give the significance level, select a sample from
the population, collect the data, perform the calculations
required for the statistical test, and reach a conclusion.

276
Idea of hypothesis testing

277
type of Hypotheses

• Null hypothesis (represented by HO) is the statement about the value of the
population parameter. That is the null hypothesis postulates that ‘there is no
difference between factor and outcome’ or ‘there is no an intervention effect’.
• Alternative hypothesis (represented by HA) states the ‘opposing’ view that
‘there is a difference between factor and outcome’ or ‘there is an intervention
effect’.

278
Methods of hypothesis testing

• Hypotheses concerning about parameters which may or may


not be true
• Examples

• The mean GPA of this class is 3.5!

• The mean height of the Gondar College of Medical Sciences


(GCMS) students is 1.63m.
• There is no difference between the distribution of Pf and Pv
malaria in Ethiopia (are distributed in equal proportions.)
279
Steps in hypothesis testing
1 2

Identify the null hypothesis H0 and Choose a. The value should be small, usually less
than 10%. It is important to consider the
the alternate hypothesis HA. consequences of both types of errors.

3 4
Select the test statistic and
determine its value from the sample Compare the observed value of the statistic to
data. This value is called the the critical value obtained for the chosen a.
observed value of the test statistic.
Remember that t statistic is usually
appropriate for a small number of
samples; for larger number of 5
samples, a z statistic can work well if Make a decision.
data are normally distributed. 6
Conclusion
280
Test Statistics

 Because of random variation, even an unbiased sample may not


accurately represent the population as a whole.
 As a result, it is possible that any observed differences or
associations may have occurred by chance.
• A test statistics is a value we can compare with known
distribution of what we expect when the null hypothesis is true.
• The general formula of the test statistics is:

Observed _ Hypothesized
Test statistics = value value .

Standard error
• The known distributions are Normal distribution, student’s distribution , Chi-
square distribution ….
281
Critical value
• The critical value separates the critical region from the noncritical region
for a given level of significance

282
Decision making
• Accept or Reject the null hypothesis
• There are 2 types of errors

Type of decision H0 true H0 false

Reject H0 Type I error (a) Correct decision (1-β)

Accept H0 Correct decision (1-a) Type II error (β)

• Type I error is more serious error and it is the level of significant


• power is the probability of rejecting false null hypothesis and it is
given by 1-β
283
284
285
286
Types of testes

H0: m = m0
One tailed test a Critical
Value(s)

H1: m < m0
0
Rejection Regions
a
H0: m = m0
H1: m > m0 0
a/2
H0: m = m0
H1: m ¹
0
m0 Two tailed test
Hypothesis testing about a Population mean
(μ)

Two Tailed Test:


The large sample (n > = 30) test of hypothesis about a population mean μ is as follows

1 H 0 :    0 (   0 )
H A : 1   0 (   0 )
x  0
zcal 

n
ztabulated  z  for two tailed test
2

if | z cal | ztab reject H o


Decision : 
if | z cal | ztab do not reject H o288
Steps in hypothesis testing…..

If the test statistic does not fall in the


If the test statistic falls in the critical critical region:
region:
Conclude that there is not enough
Reject H0 in favour of HA.
evidence to reject H0.

289
One tailed tests
2 H 0 :    0 (   0 )
H A : 1   0 (   0 )
x  0
z cal  , ztabulated  z for one tailed test

n
if z cal   ztab reject H o
Decision : 
if z cal   ztab do not reject H o
3 H 0 :    0 (   0 )
H A : 1   0 (   0 )
if z cal  ztab reject H o
Decision : 
if z cal  ztab do not reject H o
290
The P- Value

• In most applications, the outcome of performing a hypothesis


test is to produce a p-value.
• P-value is the probability that the observed difference is due to
chance.
• A large p-value implies that the probability of the value observed,
occurring just by chance is low, when the null hypothesis is true.

• That is, a small p-value suggests that there might be sufficient


evidence for rejecting the null hypothesis.
• The p value is defined as the probability of observing the
computed significance test value or a larger one, if the H0
hypothesis is true. For example, P[ Z >=Zcal/H0 true].

291
P-value……
• A p-value is the probability of getting the observed
difference, or one more extreme, in the sample purely
by chance from a population where the true
difference is zero.

• If the p-value is greater than 0.05 then, by


convention, we conclude that the observed difference
could have occurred by chance and there is no
statistically significant evidence (at the 5% level) for a
difference between the groups in the population.
292
How to calculate P-value
o Use statistical software like SPSS, SAS……..
o Hand calculations
—obtained the test statistics (Z Calculated or t-
calculated)
—find the probability of test statistics from standard
normal table
—subtract the probability from 0.5
—the result is P-value
Note if the test two tailed multiply 2 the result.
P-value and confidence interval
• Confidence intervals and p-values are based upon the same
theory and mathematics and will lead to the same conclusion
about whether a population difference exists.

• Confidence intervals are referable because they give


information about the size of any difference in the population,
and they also (very usefully) indicate the amount of
uncertainty remaining about the size of the difference.

• When the null hypothesis is rejected in a hypothesis-testing


situation, the confidence interval for the mean using the same
level of significance will not contain the hypothesized mean.
294
The P- Value …..

• But for what values of p-value should we reject the null


hypothesis?
– By convention, a p-value of 0.05 or smaller is considered
sufficient evidence for rejecting the null hypothesis.
– By using p-value of 0.05, we are allowing a 5% chance of
wrongly rejecting the null hypothesis when it is in fact
true.

• When the p-value is less than to 0.05, we often say that the
result is statistically significant.

295
Hypothesis testing for single population mean

EXAMPLE 5: A researcher claims that the mean of the IQ for 16


students is 110 and the expected value for all population is 100 with
standard deviation of 10. Test the hypothesis .
• Solution
1. Ho:µ=100 VS HA:µ≠100
2. Assume α=0.05
3. Test statistics: z=(110-100)4/10=4
4. z-critical at 0.025 is equal to 1.96.
5. Decision: reject the null hypothesis since 4 ≥ 1.96
6. Conclusion: the mean of the IQ for all population is different
from 100 at 5% level of significance.
296
Example 6:

Suppose that we have a population mean 3.1 and n=20


people x  4.5 and s  5.5 found and , our test statistic is
1. Ho:  3.1
HA:   3.1
2. α = 0.5 at 95% CI
3. t  x    4.5  3.1  1.14 t 0.05,19  2.09
s 5.5
n 20
4. the observed value of the test statistic falls with in the
range of the critical values
5. we accept Ho and conclude that there is no enough
evidence to reject the null hypothesis. 297
Cont….

A 95% confidence interval for the mean is

x  t 0.05,19 s / n  4.5  2.09(5.5 / 20 )  (1.93,7.07)

Note that this interval includes the hypothesis


value of 3.1

298
Hypothesis testing for single proportions

Example 7: In the study of childhood abuse in psychiatry patients, brown


found that 166 in a sample of 947 patients reported histories of physical or sexual
abuse.
a) constructs 95% confidence interval
b) test the hypothesis that the true population proportion is 30%?
• Solution (a)
– The 95% CI for P is given by
 p (1  p )
p  z
2 n
0.175  0.825
 0.175  1.96 
947
 0.175  1.96  0.0124
 [0.151 ; 0.2]
299
Example……

• To the hypothesis we need to follow the steps


Step 1: State the hypothesis
Ho: P=Po=0.3
Ha: P≠Po ≠0.3
Step 2: Fix the level of significant (α=0.05)
Step 3: Compute the calculated and tabulated value of the test statistic

p  Po 0.175  0.3  0.125
zcal     8.39
p (1  p ) 0.3(0.7) 0.0149
n 947
ztab  1.96
300
Example……
• Step 4: Comparison of the calculated and tabulated values of the
test statistic
• Since the tabulated value is smaller than the calculated value of
the test the we reject the null hypothesis.
• Step 6: Conclusion
• Hence we concluded that the proportion of childhood abuse in
psychiatry patients is different from 0.3

• If the sample size is small (if np<5 and n(1-p)<5) then use student’s
t- statistic for the tabulated value of the test statistic.

301
Statistical Inference Based on
Two Samples

302
Statistical Inferences Based on Two Samples

Comparing Two Population Means;


 Independent Samples: Variances Known
 Independent Samples: Variances Unknown

Paired Difference Experiments


 Paired/matched/repeated sampling

Comparing Two Population Proportions


 Large, Independent Samples case
Comparing Two Population Means;
Case-1: Independent Samples, Variances Known

•  
Comparing Two Population Means;
Independent Samples, Vars Known cont’d…

•  

 12  22
 x x  
1 2
n1 n2
Comparing Two Population Means;
Ind’t Samples, Vars Known cont’d…

• A (1 – ) 100% confidence interval for the difference in


populations µ1–µ2 is;
 2 
1  2
2
 x1  x2   z 2  
 n1 n2 

• In testing hypothesis, the z value can then be calculated as;

z 
 x1  x2   D0
12 2
 2
; where Do = (µ1 – µ2)o
n1 n2
Hypothesis testing for two sample means

• The steps to test the hypothesis for difference of means is the


same with the single mean
Step 1: state the hypothesis
Ho: µ1-µ2 =0
VS
HA: µ1-µ2 ≠0, HA: µ1-µ2 <0, HA: µ1-µ2 >0
Step 2: Significance level (α)
Step 3: Test statistic

( x  y )  ( 1   2 )
z cal 
12  22

n1 n2 307
Hypothesis …

ztabulated  z  for two tailed test


2

ztabulated  z for one tailed test


if | zcal | ztab reject H o
For H A : 1   2  0
if | zcal | ztab do not reject H o
if zcal   ztab reject H o
For H A : 1   2  0
if zcal  zcal do not reject H o
if z cal  zcal reject H o
For H A : 1   2  0
if z cal  zcal do not reject H o

308
Example 8:

• A researchers wish to know if the data they have collected provide


sufficient evidence to indicate a difference in mean serum uric
acid levels between normal individual and individual with down’s
syndrome. The data consists of serum uric acid readings on 12
individuals with down’s syndrome and 15 normal individuals. The
means are 4.5mg/100ml and 3.4 mg/100ml with standard
deviation of 2.9 and 3.5 mg/100ml respectively.

H O : 1   2  0
H A : 1   2  0

309
SOLUTION

( x  y )  ( 1   2 ) ( 4.3  3.4)  0
z cal  
 2
 2
2.9 2 3.5 2
1
 2

n1 n2 12 15
1.6 1.6
   5.33
1.5178 1.23
z   z 0.025  1.96
2

310
Comparing Two Population Means;
case-2: Independent Samples, Variances Unknown

•  
Comparing Two Population Means;
Ind’t Samples, Vars Unknown cont’d…

A. Assume that the unknown variances; σ12 = σ22 = σ2


• The pooled estimate of σ2 is the weighted average of the two
sample variances, s12 and s22
• The pooled estimate of σ2 is denoted by sp2 .

s 2p 
 n1  1 s12   n2  1 s 22
n1  n 2 2

• The estimate of the population standard deviation of the


sampling distribution is;
2 1 1 
 x1  x2  s p   
 n1 n2 
Comparing Two Population Means;
Ind’t Samples, Vars Unknown cont’d…

• The sampling distribution of is in this case


approximately normal when both n1 and n2 are large (>30),
irrespective of the distribution of the population. So, in this
case; the CI and Z-cal are given as;
• A (1 – ) 100% CI for µ1 – µ2 is;
  1 1 
 x1  x2   Z 2 s 
2
pn  n 

  1 2 

• The calculated value of z will be;
z 
 x1  x2   D0
 1 1  where Do = (µ1 – µ2)o
s 2p 
n  
 1 n2 

Comparing Two Population Means;
Ind’t Samples, Vars Unknown cont’d…

•  

 1 
2 1
 x1  x2   t 2 ( n1  n 2  2) s p 
n  n 

  1 2 

t 
 x1  x2   D0
 1 1 
s 
n  n 
2
p 
 1 2 
Comparing Two Population Means;
Ind’t Samples, Vars Unknown cont’d…

•  

z 
 x1  x2   D0
s12 s 22

n1 n2

 x1  x2   D0 df 
s 2
1 /n1  s /n 2 
2
2
2

s /n1   s /n 2 
t  2 2
s2
s 2 2 2


1 2
1 2
n1 n2
n1  1 n2  1
Comparing Two Population Means;
Case-3: Paired/matched/repeated sampling

• Rises from two different processes on same study units (e.g.


"before” and “after” treatments) or two different processes on
paired/matched study units ( e.g. Pair matched case control studies).
• Use of the same/matched individuals, eliminates any differences in
the individuals themselves (confounding factors).
• Inference concerning the difference between two population means
is similar to one population mean; except that we will be
manipulating on the dis here.
Paired sampling cont’d…

•  
Paired sampling cont’d…

• If the population of differences is normally distributed with


mean d
• A (1- )100% confidence interval for
µd = µ1 - µ2 is:
 sd 
d  t /2 
 n

• Where for a sample of size n, t/2 is based on n – 1 degrees of


freedom.

• but Z-test can be used if the sample size is large


(n1=n2=n>30).
Paired sampling cont’d…

•  
Paired sampling cont’d…

•  
Hypothesis testing for two proportions

• Suppose that n1 and n2 are large enough so that;


– n1·p1≥5, n1·(1 - p1)≥5, n2·p2≥5, and n2·(1 – p2)≥5

• Then the population of all possible values of p̂1 - p̂2;

– Has approximately a normal distribution

– Has mean µp̂1 - p̂2 = p1 – p2


p1 1  p1  p 2 1  p 2 
– Has standard deviation;  p̂1  p̂2  
n1 n2
Hypothesis testing for two proportions

• A (1 – ) 100% confidence interval for p1 - p2;

 p̂1 1  p̂1  p̂ 2 1  p̂ 2  
 p̂1  p̂ 2   z  2  
 n1 n2 

• The test statistic is;

z=
 p̂1  p̂ 2   D0
 p̂1  p̂2
where Do = (P1-P2)0
Hypothesis testing for two proportions

• To test the hypothesis


Ho: π1-π2 =0
VS
HA: π1-π2 ≠0
The test statistic is given by

( p1  p2 )  ( 1   2 )
zcal 
p1 (1  p1 ) p2 (1  p2 )

n1 n2
323
Small sample size

•  

324
Comparing Two Population Proportions cont’d…

Example 10: A study was conducted to look at the effects of oral


contraceptives (OC) on heart disease in women 40–44 years of age. It is
found that among n1 = 500 current OC users, 13 develop a myocardial
infarction (MI) over a three-year period, while among n2 = 1000 non-
OC users, seven develop a MI over a three-year period. Then;

A. Construct a 95% confidence interval for the difference of MI rates


between OC-users and non-users.
B. Can you conclude that rate of MI is significantly greater among OC
users? (Report the P-value for your test)
Comparing Two Population Proportions cont’d…

• Solution: The estimation (CI) for the difference of population proportions should be
formed using the following formula (for a 95% confidence interval):

A.

Where ≈ 0.005.

 The 95% CI for the difference is = (0.012, 0.026)

 pˆ1 1  pˆ 1  pˆ 2 1  pˆ 2  
 pˆ 1  pˆ 2   z 2    0.019  1.96 * 0.0035
 n1 n2 
ANOVA

327
Introduction
 Here in the case of two independent sample t-test, we
have one continuous dependent variable (interval/ratio
data) and;

 one nominal or ordinal independent variable with only


two categories

 In this last case (i.e. two independent


sample t-test), what if there are
more than two categories for the
independent variable we have?
 Are the birth weights of children in different geographical
regions the same?

 Are the responses of patients to different medications and


placebo different?

 Are people with different age groups have different


proportion of body fat?

 Do people from different ethnicity have the same BMI?


One way-Analysis Of Variance

 All the above research questions have one common


characteristic: That is each of them has two variables: one
categorical and one quantitative

 Main question: Are the averages of the quantitative


variable across the groups (categories) the same?

 Because there is only one categorical independent variable


which has two or more categories (groups), the name one
way ANOVA comes.
One-way ANOVA cont…

 Also called Completely Randomized Design

 Experimental units (subjects) are assigned randomly


to treatments/groups. Here subjects are assumed to
be homogeneous

331
Analysis of variance cont…

 One way ANOVA is a method for testing the hypothesis:


There is no difference between two or more population means (usually at
least three); or there is no difference between a number of treatments

 More formally, we can state hypotheses as:

H0: There is no difference among the mean of treatments effects


HA: There is difference at least between two treatments effects
or
Ho: µ1 = µ2 = µ3 =…. = µa (if there are ‘a’ groups)
HA: at least one group mean is different
332
Why Not Just Use t-tests?
 Since t-test considers two groups at a time, it will be tedious when
many groups are present

 Conducting multiple t-tests can lead also to severe inflation of the


Type I error rate (false positives) and is not recommended

 However, ANOVA is used to test for differences among several means


without increasing the Type I error rate

 The ANOVA uses data from all groups at a time to estimate standard
errors, which can increase the power of the analysis
Assumptions of One Way ANOVA)

 The data are normally distributed or the samples have come


from normally distributed populations and are independent.

 The variance is the same in each group to be compared (equal


variance).

 Moderate departures from normality may be safely ignored, but


the effect of unequal standard deviations may be serious.

 In the latter case, transforming the data may be useful.


Analysis of variance cont…
 We test the equality of means among groups by using
the variance

 The difference between variation within groups and


variation between groups may help us to compare the
means

 If both are equal, it is likely that the observed difference


is due to chance and not real difference
Note that:
Total Variability = Variability between + Variability within
Analysis of variance
Basic model: Data are deviations from the global

μ mean, μ:(The Linear Model)


Xij = μ + Ɛij
 Sum of vertical deviations squared is the total sum
of squares = SSt
G-1 G-2

A2

One way model: Data are deviations
from treatment means, Ais:
Xij = μ + Ai + Ɛij
A1  Sum of vertical deviations squared = SSe

G-1 G-2  Note that ΣAi = ΣƐij = 0


Decomposing the total variability
n a n a na
 Total SS = Σ Σ (xij – )2 = ΣiΣjxij2 - (ΣiΣjxij)2 /na = SST
i=1 j=1

n a n a a n
 Within SS = Σ Σ (xij – j )2 = ΣiΣjxij2 - Σj(Σixij)2/n = SSW
i=1 j=1

n a a n
 Between SS = Σ Σ ( ij – )2 = Σj(Σixij)2/n - (ΣiΣjxij)2 /na = SSB
i=1 j=1

This is assuming each of the ‘a’ groups has equal size, ‘n’.

SST = SSW + SSB

337
Data of one way ANOVA

Groups/variable
G-1 G-2 G-3 ….. G-a
X11 X12 X13 ….. X1a
X21 X22 X23 ….. X2a
Participants

X31 X32 X33 ….. X3a


. . . . .
. . . . .
. . . . .
Xn1 Xn2 Xn3 …. Xna
Totals T.1 T.2 T.3 …. T.a

Computational formula
T= ΣiΣjxij2 Correction Factor = CF = (ΣiΣjxij)2 /na = T2../na
A = Σj(Σixij)2/n = Σj(T.j)2/n if the groups’ (cells’) size are equal, or
A = Σj(Σixij)2/nj = Σj(T.j)2/nj ; if unequal group size

Where, Xij = ith observation in the jth group of the table


i = 1, 2, 3,…, nj, j = 1, 2, 3,…,a, Σjnj = N
Sum of squares and ANOVA Table

Source of df SS MS F
variation
Between groups a-1 SSB = A - CF SSB/(a-1) MSB/MSW
Within groups na-a SSW = T - A SSW/(na –a)
Total na-1 SST = T - CF

 If there are real differences among groups’ means, the between


groups variation will be larger than the within variation
Example on one-way ANOVA
The following table shows the red cell folate levels (μg/l) in three groups of
cardiac bypass patients who were given three different levels of nitrous oxide
ventilation. (Level of nitrous oxide for group I > group II’s > group III’s)

Group I Group II Group III


(n=8) (n=9) (n=5)
243 206 241
251 210 258
275 226 270
291 249 293
347 255 328
354 273
380 285
392 295
309
Total=2533 2308 1390
Mean =316.6 256.4 278.0
SD = 58.7 37.1 33.8
Example Cont….
We can see the box plot just to have some
impression about it
Example cont…
Ho: μ1 = μ2 = μ3
HA: Differences exist between at least two of the means
Source of variation df SS Mean F P
square

Between groups 2 15516 7758


3.71 0.044

Within groups 19 39716 2090

Total 21 55232

Since the P-value is less than 0.05, the null hypothesis is rejected
Pair-wise comparisons of group means post hoc tests or multiple comparisons

 ANOVA test tells us only whether there is statistically significant


difference among groups means, but

 It doesn’t tell us which particular groups are significantly


different

 To identify them, we use either a priori (pre-planed) or post hoc


tests
Pair-wise comparisons of group means (post hoc tests) cont…

 Whether to use a priori or post hoc tests depends on whether the


researcher has previously stated the hypotheses to test.

 If you have honestly stated beforehand the comparisons between


individual pairs of means which you intend to make, then you are
entitled to use a priori test such as a t-test.
 In this case, only one pair of groups or few will be tested

 However, when you look at the data it may seem worth


comparing all possible pairs. In this case, a post hoc test such as
Scheffe, Benferroni (modified t-test), Tuckey methods, Least
Squares Difference (LSD), etc will be employed.
Benferroni method or Modified t-test (Steps)

I. Find tcalc for the pairs of groups of interest (to be compared)

II. The modified t-test is based on the pooled estimate of


variance from all the groups (which is the residual variance
in the ANOVA table), not just from pair being considered.

III. If we perform k paired comparisons, then we should multiply


the P value obtained from each test by k; that is, we calculate
P' = kP with the restriction that P' cannot exceed 1.
Where, , that is the number of possible comparisons
Benferroni method or Modified t-test
 Returning to the red cell folate data given above, the residual
standard deviation is = 45.72.

(a) Comparing groups I and II


t = (316.6 - 256.4) / (45.72 x √(1/9 +1/8)
= 2.71 on 19 degrees of freedom.

 The corresponding P-value = 0.014 and the


corrected P value is P' = 0.014x3
= 0.042
Group I and II are different
Benferroni method or Modified t-test
 
(b) Comparing groups I and III

 t = (316.6 - 278.0) / (45.72 x √(1/8+1/5)


= 38.6/26.06
= 1.48 on 19 degrees of freedom.

 The corresponding P value = 0.1625 and

 The corrected P value is P' = 0.1625x3


= 0.4875
Group I and III are not different
Benferroni method or Modified t-test

(c) Comparing Groups II and III

 t = (278 - 256.4) / (45.72 x √(1/5+1/9)


= 21.6/25.5
= 0.85 on 19 degrees of freedom.

 The corresponding P value = 0.425 and the corrected P value is P' =


1.00

Group I and III are not different

Therefore, the main explanation for the difference between the


groups that was identified in the ANOVA is thus the difference
between groups I and II.
Which post hoc method Shall I use?
 The post hoc tests differ from one another in how they calculate
the p value for the mean difference between groups.

 Least Squares Difference (LSD) is the most liberal of the post hoc
tests and has a high Type I error rate. It is simply multiple t-tests

 The Scheffé test uses the F distribution rather than the t


distribution of the LSD tests and is considered more conservative.

It has a high Type II error rate but is


considered appropriate when there are a
large number of groups to be compared.
Which post hoc method Shall I use? cont…

 The Bonferroni approach uses a series of t tests ( that is


the LSD technique) but corrects the significance level
for multiple testing by dividing the significance levels by
the number of tests being performed

 Since this test corrects for the number of comparisons


being performed, it is generally used when the number
of groups to be compared is small.
Which post hoc method Shall I use? Cont..

 Tukey’s Honesty Significance Difference (Tukey’s HSD) test also


corrects for multiple comparisons, but it considers the power of
the study to detect differences between groups rather than just
the number of tests being carried out;

 That is, it takes into account sample size as well as the number
of tests being performed.

 This makes it preferable when there are a large number of


groups being compared, since it reduces the chances of a Type I
error occurring.
ANOVA - a recapitulation.

 ANOVA is a parametric test, examining whether the means differ between 2 or more populations.

 It generates a test statistic F, which can be thought of as a signal: noise ratio. Thus large values of F indicate a high degree of pattern within the data and imply rejection of Ho.

 It is thus similar to the t test - in fact ANOVA on 2 groups is equivalent to a t test [F = t2 ]


One way ANOVA’s limitations

 This technique is only applicable when there is one


treatment used.

 Note that this single treatment can have 3, 4,… ,many


levels.
 Thus nutrition trial on children weight gain with 4
different feeding styles could be analyzed this way, but a
trial of BOTH nutrition and mothers health status could
not
Thank You!

354
University of Gondar
College of medicine and health science
Department of Epidemiology and Biostatistics

• Sampling Techniques & sample size


determination
Prepared By: Department of Epidemiology and Biostatistics

• December, 2018

University
of Gondar, Ethiopia

355
Sampling
It is not easy to collect all the information
about population and also it is not possible to
study the characteristics of the entire
population (finite or infinite) due to time factor,
cost factor and other constraints.
Thus we need sample.
Sample is a finite subset of statistical
individuals in a population and the number of
individuals in a sample is called the sample size.
07/09/2021 Wullo S.(MPH) 356
Sample Information

Population

07/09/2021 357
Common terms used in sampling
• Population: it is the collection of all items
of interest.
• Sampling: It is the method by which we
select a sample from the population
• Reference population (or target
population): the population of interest to
whom the researchers would like to make
generalizations.

07/09/2021 (MPH) 358


Advantages of sampling:

• Feasibility: Sampling may be the only feasible method of


collecting information.
• Reduced cost: Sampling reduces demands on resource such as
finance, personnel, and material.
• Greater accuracy: Sampling may lead to better accuracy of
collecting data
• Sampling error: Precise allowance can be made for sampling
error
• Greater speed: Data can be collected and summarized more
quickly.
Disadvantage
• There is always a sampling error
• Sampling may create a feeling of discrimination within the
population
• Sampling may be inadvisable where every unit in the
population is legally required to have a record.
07/09/2021 359
Errors in sampling
1. Sampling error/ Random error
A sample is expected to mirror the population from
which it comes, however, there is no guarantee that
any sample will be precisely representative of the
population.
The uncertainty associated with an estimate that is
based on data gathered from a sample of the
population rather than the full population is known as
sampling error.
Sampling errors are the random variations in the
sample estimates around the true population
parameters.
07/09/2021 360
Error sampling cont’d…
No sample is the exact mirror image of the population
Sampling error (chance )
Can not be avoided or totally eliminated
Sampling error decreases with the increase in the size
of the sample, and it happens to be of a smaller
magnitude in case of homogeneous population.
When n = N ⇒ sampling error = 0

07/09/2021 361
Error in Sampling cont---
2. Non Sampling Error (Measurement Error)
It is a type of systematic error in the design or conduct
of a sampling procedure which results in distortion of
the sample, so that it is no longer representative of the
reference population.
We can eliminate or reduce the non-sampling error
(bias) by careful design of the sampling procedure and
not by increasing the sample size.
It can occur whether the total study population or a
sample is being used.

07/09/2021 362
Sampling Methods
Two broad divisions:
A. Probability sampling methods
B. Non-probability sampling methods
A. Probability sampling methods
• Involves random selection of a sample
• Every sampling unit has a known and non-zero probability
of selection into the sample.
• Involves the selection of a sample from a population,
based on chance.
07/09/2021 363
Types of Sampling Methods

Samples
Method

Probability Samples
Non-Probability
Samples

Snowball Simple Stratified


Random
Purposive Judgemental
Systematic Cluster

Convenience
Multistage Random Sampling
Quota
07/09/2021 364
1. Simple random sampling
• The required number of individuals are selected at random from the
sampling frame, a list or a database of all individuals in the population .

• Each member of a population has an equal chance of being included in


the sample.
• To use a SRS method:
– Make a numbered list of all the units in the population i.e. Sampling
frame
– Each unit should be numbered from 1 to N (where N is the size of
the population)
– Select the required number.
• The randomness of the sample is ensured by:
• Use of “lottery’ methods
• Table of random numbers
• Computer programs
07/09/2021 365
2. Systematic random sampling
• Sometimes called interval sampling
• Selection of individuals from the sampling frame
systematically rather than randomly
• Individuals are taken at regular intervals down the list
• The starting point is chosen at random
• Important if the reference population is arranged in
some order:
– Order of registration of patients
– Numerical number of house numbers
– Student’s registration books
• Taking individuals at fixed intervals (every kth) based on
the sampling fraction, eg. if the sample includes 20%,
07/09/2021 366
then every fifth.
3. Stratified random sampling

• It is done when the population is known to be have


heterogeneity with regard to some factors and those
factors are used for stratification
• Using stratified sampling, the population is divided into
homogeneous, mutually exclusive groups called strata, and
• A population can be stratified by any variable that is
available for all units prior to sampling (e.g., age, sex,
province of residence, income, etc.).
• A separate sample is taken independently from each
stratum.
• Any of the sampling methods mentioned in this section (and
others that exist) can be used to sample within each
stratum.
07/09/2021 367
• Equal allocation:
– Allocate equal sample size to each stratum
• Proportionate allocation: n  n N
j j
N

– nj is sample size of the jth stratum


– Nj is population size of the jth stratum
– n = n1 + n2 + ...+ nk is the total sample size
– N = N1 + N2 + ...+ Nk is the total population size
• Example: proportionate allocation
• Village A B C D Total
• HHs 100 150 120 130 500
• 07/09/2021
S. size ? ? ? ? 60 368
4. Cluster sampling
• Sometimes it is too expensive to carry out SRS
– Population may be large and scattered.
– Complete list of the study population unavailable
– Travel costs can become expensive if interviewers have to
survey people from one end of the country to the other.
• Cluster sampling is the most widely used to reduce the
cost
• The clusters should be homogeneous, unlike stratified
sampling where the strata are heterogeneous

07/09/2021 369
Example
• In a school based study, we assume students of
the same school are homogeneous.

• We can select randomly sections and include all


students of the selected sections only

07/09/2021 370
5. Multi-stage sampling
• Similar to the cluster sampling, except that it
involves picking a sample from within each
chosen cluster, rather than including all units in
the cluster.
• This type of sampling requires at least two stages.
• The primary sampling unit (PSU) is the sampling
unit in the first sampling stage.
• The secondary sampling unit (SSU) is the
sampling unit in the second sampling stage, etc.
07/09/2021 371
Woreda PSU

Kebele SSU

Sub-Kebele TSU

HH

07/09/2021 372
B. Non-probability sampling
• In non-probability sampling, every item has an
unknown chance of being selected.

• In non-probability sampling, there is an assumption


that there is an even distribution of a characteristic of
interest within the population.

• For probability sampling, random is a feature of the


selection process.
• This is what makes the researcher believe that any
sample would be representative and because of that,
results will be accurate.
07/09/2021 373
The most common types of non-probability sampling

1. Convenience or haphazard sampling


2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique

07/09/2021 374
1. Convenience or haphazard sampling
• Convenience sampling is sometimes referred to as
haphazard or accidental sampling.

• It is not normally representative of the target population


because sample units are only selected if they can be
accessed easily and conveniently.

• The obvious advantage is that the method is easy to use,


but that advantage is greatly offset by the presence of bias.

• Although useful applications of the technique are limited,


it can deliver accurate results when the population is
homogeneous.
07/09/2021 375
2. Volunteer sampling
• As the term implies, this type of sampling occurs when
people volunteer to be involved in the study.

• In psychological experiments or pharmaceutical trials


(drug testing), for example, it would be difficult and
unethical to enlist random participants from the general
public.

• In these instances, the sample is taken from a group of


volunteers.

• Sometimes, the researcher offers payment to attract


respondents.
07/09/2021 376
3. Judgment sampling

• This approach is used when a sample is taken based on


certain judgments about the overall population.

• The underlying assumption is that the investigator will


select units that are characteristic of the population.

• The critical issue here is objectivity: how much can


judgment be relied upon to arrive at a typical sample?
• Judgment sampling is subject to the researcher's
biases.

• One advantage of judgment sampling is the reduced


cost and time involved in acquiring the sample.
07/09/2021 377
Quota sampling

 It is the non probability equivalent of stratified sampling.

 Like stratified sampling, the researcher first identifies the


stratums and their proportions as they are represented in
the population.

 Then convenience or judgment sampling is used to select


the required number of subjects from each stratum.

 This differs from stratified sampling, where the strata are


filled by random sampling.
5. Snowball sampling
• A technique for selecting a research sample where
existing study subjects recruit future subjects from
among their friends.
• Thus the sample group appears to grow like a rolling
snowball.
• This sampling technique is often used in hidden
populations which are difficult for researchers to access;
example populations would be drug users or commercial
sex workers.
• Because sample members are not selected from a
sampling frame, snowball samples are subject to
numerous biases. For example, people who have many
friends are more likely to be recruited into the sample. 379
07/09/2021
Sample size Determination

• How Big is Big Enough?

• Generally the larger the better, but that takes


more time and money.

380
How many people to study?

381
If too many….
• Waste of resources!

382
If too few….
• May fail to detect an important effect

• Estimates of effect may be too imprecise


(wide CI’s)

383
Sample size …
• Which variables should be included in sample size
calculation?
 It should relate to the study’s primary outcome variable
 If the study have secondary outcome variables which
are considered important, the sample size should also
be sufficient for the analysis of these variables.
• Answer depends on:
– How different or dispersed the population is.
– Desired level of confidence.
– Desired degree of accuracy.
– Desired margin of error

384
How to do we calculate a sample size

Formulae and software commands


in notes
or
ask statistician
– Rules of thumb approach
– Confidence interval approach
– Hypothesis testing approach

385
1. Rules of thumb approach
Different Views:
1. The larger the population size, the smaller the percentage of
the population required to get a representative sample
2. For smaller samples (N ‹ 100), there is little point in
sampling. Survey the entire population.
3. If the population size is around 500 50% should be sampled.
4. If the population size is around 1500, 20% should be
sampled.
5. Statistician – máxima list – at least 500
6. To make generalizations about entire population, need a
total sample size of 200-400
386
Some Considerations

07/09/2021 387
Summary
® Large-scale descriptive studies almost always
use probability-sampling techniques.
® Intervention studies sometimes use
probability sampling but also frequently use
non-probability sampling.
® Qualitative studies almost always use non
probability samples.

07/09/2021 388

You might also like