Bstat Handouts - Descriptive Only Handouts 2
Bstat Handouts - Descriptive Only Handouts 2
LA SALLE
Yu An Log College of Business and Accountancy
HANDOUTS 2
Defn: Sampling – the process of selecting the subjects of the population to be included in the sample
Why Sample?
Why should we not use the population as the focus of study? There are at least four major reasons to
sample.
The second reason to sample is that it may be impossible to test the entire population. For example, let
us say that we wanted to test the 5-HIAA (a serotonergic metabolite) levels in the cerebrospinal fluid
(CSF) of depressed individuals. There are far too many individuals who do not make it into the mental
health system to even be identified as depressed, let alone to test their CSF.
The third reason to sample is that testing the entire population often produces error. Thus, sampling
may be more accurate. Perhaps an example will help clarify this point. Say researchers wanted to
examine the effectiveness of a new drug on Alzheimer's disease. One dependent variable that could be
used is an Activities of Daily Living Checklist. In other words, it is a measure of functioning on a day to
day basis. In this experiment, it would make sense to have as few of people rating the patients as
possible. If one individual rates the entire sample, there will be some measure of consistency from one
patient to the next. If many raters are used, this introduces a source of error. These raters may all use a
slightly different criteria for judging Activities of Daily Living. Thus, as in this example, it would be
problematic to study an entire population.
The final reason to sample is that testing may be destructive. It makes no sense to lesion the lateral
hypothalamus of all rats to determine if it has an effect on food intake. We can get that information
from operating on a small sample of rats. Also, you probably would not want to buy a car that had the
door slammed five hundred thousand time or had been crash tested. Rather, you probably would want
to purchase the car that did not make it into either of those samples.
It is extremely important to choose a sample that is truly representative of the population so that the
inferences derived from the sample can be generalized back to the population of interest. Improper and
biased sampling is the primary reason for often divergent and erroneous inferences reported in opinion
polls and exit polls conducted by different polling groups
LEONARES, S. R. 1
Types of Sampling:
A. Probability sampling
each element of the population is given a chance (non-zero probability) of being included in
the sample
minimizes, if not eliminates, selection bias
ideal if generalizability of results is important for your study
inferential statistical procedures can be used for arriving at generalizations/conclusions about
the population based on sample data
1. Simple Random
• Each element of the population is given an equal chance of being included in the
sample
• Most basic probability sampling procedure
• Foundation of all probability sampling procedures
• When to use:
– The population is homogeneous
– A sampling frame is available (sampling frame – complete and updated list
of the population)
• Procedure:
– Lottery
– Use of random number generators
2. Systematic Random
• Selecting every kth element of the population
• When to use:
– When the population is homogenous and there is no suspicion of a trend
or pattern in the frame or geographical layout
– A sampling frame is available
• Procedure:
i. Determine the sampling interval, k = N/n (rounded to the nearest interval)
ii. Identify the random start, rs: 1 ≤ rs ≤ k (randomly drawing a value between
1 and k)
iii. Determine the number of the elements to be included in the sample: rs, rs
+ k, rs + 2k, …
LEONARES, S. R. 2
3. Stratified Random
• selecting random samples from mutually exclusive subpopulations, or strata,
of the population.
• When to use:
– When the population is heterogeneous but can be subdivided into
homogeneous subgroups or strata
– A sampling frame is available for each stratum
• Procedure:
i. Determine the proportion of each stratum relative to the population
ii. Identify the stratum sample sizes using proportional allocation
iii. Select the samples from each stratum using either simple or systematic
random sampling
Example: Among the 250 employees of the local office of an international insurance
company, 182 are Filipinos, 51 are Chinese, and 17 are Americans. If we use
proportional allocation to select a stratified random grievance committee of 15
employees, how many employees must we take from each race?
Solution:
Race (i) Ni % ni
Therefore, 11 Filipinos, 3 Chinese, and 1 American will compose the grievance committee.
These will have to be randomly selected from among each of the subgroups.
4. Cluster Random
• Selecting clusters of elements rather than individual elements
• When to use:
– when "natural" groupings are evident in a statistical population
– a sampling frame is not available
• Procedure:
i. Divide the population into clusters (M =total number of clusters)
ii. Randomly select m clusters
iii. Include all elements within the selected clusters to form the resulting
sample
LEONARES, S. R. 3
B. Non-probability sampling
not all elements of the population are given a chance of being included in the sample
prone to selection bias
however, there may be unique circumstances where non-probability sampling can also be
justified (e.g., medical researches), although the generalizability of the conclusion is not
assured
inferential statistical procedures cannot be used for arriving at generalizations/conclusions
about the population based on sample data
2. Judgmental/Purposive/Expert
• The researcher selects the sample based on his judgment as to who best fit the
established criteria
3. Quota
• Selecting sample elements nonrandomly according to some fixed quota
4. Snowball
• Especially useful when you are trying to reach populations that are
inaccessible or hard to find
Problems in Sampling:
There are several potential sampling problems. When designing a study, a sampling procedure is also
developed including the potential sampling frame. Several problems may exist within the sampling
frame.
1. Missing elements - individuals who should be on your list but for some reason are not on the list. For
example, if my population consists of all individuals living in a particular city and I use the phone
directory as my sampling frame or list, I will miss individuals with unlisted numbers or who can not
afford a phone.
2. Foreign elements - Elements which should not be included in my population and sample appear on
my sampling list. Thus, if I were to use property records to create my list of individuals living within a
particular city, landlords who live elsewhere would be foreign elements. In this case, renters would be
missing elements.
3. Duplicates - these are elements who appear more than once on the sampling frame. For example, if I
am a researcher studying patient satisfaction with emergency room care, I may potentially include the
same patient more than once in my study. If the patients are completing a patient satisfaction
questionnaire, I need to make sure that patients are aware that if they have completed the
questionnaire previously, they should not complete it again. If they complete it more that once, their
second set of data represents a duplicate.
Read:
https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1002&context=
oa_textbooks (Chapter 8 – Sampling)
LEONARES, S. R. 4
DATA COLLECTION PROCEDURES
1. Interview
• There is interaction between interviewer and respondent
• Most important method of data collection
• Some advantages:
o Clarifications about ambiguous questions/answers can be made
o More in-depth information can be generated
• Some disadvantages:
o Time-consuming
o Costly
o Responses may be influenced by the interviewer
2. Questionnaire
• No interaction between facilitator and respondent about the subject matter
• Respondent personally answers the questions on survey forms
• Some advantages:
o Less costly
o Less time- consuming
o Responses are not influenced by the interviewer
o Respondents answer the questions with relative anonymity; may answer more
truthfully
• Some disadvantages:
o Not effective if the respondent is illiterate
o Clarifications about vague questions cannot be made
o Respondents may misinterpret the questions
o Intended respondents may not personally answer the forms; may request other
people to respond
o Low rate of returns
3. Experimentation
• a controlled study in which the researcher attempts to understand cause-and-effect
relationships
• The study is "controlled" in the sense that the researcher controls
(1) how subjects are assigned to groups and
(2) which treatments each group receives.
4. Observation
• Like experiments, observational studies attempt to understand cause-and-effect relationships
• Unlike experiments, the researcher is not able to control (1) how subjects are assigned to
groups and/or (2) which treatments each group receives.
• Also used for behavioral, attitudinal studies
Read:
https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1002&context=
oa_textbooks (Chapters 9 & 10)
LEONARES, S. R. 5
ORGANIZATION AND PRESENTATION OF DATA
Frequency Distribution - A tabular summary of data showing the number (frequency) of items in each of
several non-overlapping classes.
Example: The following data were obtained from a sample of 50 soft drink purchases. Construct a
frequency distribution to summarize the data.
LEONARES, S. R. 6
Graphical presentations of qualitative data:
1. Bar graph – A graphical device for depicting qualitative data that have been summarized in a frequency,
or percent distribution
16
14
No. of bottles bought
12
10
8
6
4
2
0
Coke Coke Zero Pepsi Pepsi Max Sprite Mountain
Dew
Soft drink brand
2. Pie chart – A graphical device for presenting data summaries based on subdivision of a circle into sectors
that correspond to the percentage frequency for each category
12%
28%
Coke
12%
Coke Zero
Pepsi
Pepsi Max
Sprite
20% 12%
Mountain Dew
16%
USING EXCEL: Watch Excel Statistics 15: Category Frequency Distribution w Pivot Table & Pie Chart by
ExcellsFun at http://www.youtube.com/watch?v=-ERARVSfeuw
3. Rod Graph – a form of bar graph where the bars have zero width. It is especially used when the data
are discrete
Patient 1 2 3 4 5 6 7 8 9 10 11 12
score 4 3 5 1 4 4 2 5 4 3 4 5
Array: 1, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5
Distinct score values: 1, 2, 3, 4, 5 (Ordinal data)
LEONARES, S. R. 7
Table 1.3. Frequency distribution of anxiety scores
Score Frequency Percentage
1 1 8.3
2 1 8.3
3 2 16.7
4 5 41.7
5 3 25.0
Total 12 100.0
Rod Graph:
5
Frequency
0
0 1 2 3 4 5
Score
SHAPES OF DISTRIBUTIONS:
The rod graph (and later the histogram or frequency polygon) provide information about the
shapes of the distributions – how the collected data are distributed over the possible values of the
variable. There are three major types:
1. Symmetric – the shape of the left side of the distribution is a mirror image of the right side
2. Skewed – the two sides of the distribution are not mirror images of each other
a. Positively skewed (skewed to the right, right-skewed) – scores tend to cluster toward the lower
end of the scale (i.e., the smaller numbers) with increasingly fewer scores at the upper end of
the scale (the larger numbers)
b. Negatively skewed (skewed to the left, left-skewed)– most of the scores tend to occur toward
the upper end of the scale while increasingly fewer score occur toward the lower end
Note: the height of the graph represents the corresponding frequency of the point on the
horizontal axis
LEONARES, S. R. 8
6
Example:
5
4
Frequency
3
negatively skewed
2
longer left tail than right tail
1 more scores to the right of the
center (score=3) than to the left
0
0 2 Score 4 6
NOTE: more of the shape will be discussed in relation to measures of central tendency (later topic)
READ: https://www.mathbootcamps.com/common-shapes-of-distributions/
> These procedures can be used for either continuous or discrete data
Characteristics:
1. Non-overlapping class intervals (also called classes or intervals).
• use between 5 to 20 classes.
• use enough classes to show the variation in the data, but not so many that some contain only a few
items.
2. Each class has a lower limit (the lowest possible value that can belong to it) and an upper limit (highest
possible value that can belong to it
Example: 11- 15 the class interval contains values from 11 to 15 (includes 11, 12, 13, 14, 15)
3. Uniform class width for all classes (also called interval size).
This may be identified by the difference between two successive lower limits or two successive
upper limits
Can be generated by applications like Excel or statistical software
Example: The following date correspond to the age of the eldest child of parents in a given class:
12 14 19 18 16 30
15 15 18 17 21 31
20 27 22 23 15 25
22 21 33 28 14 22
14 18 16 13 27 18
LEONARES, S. R. 9
Table 4. Frequency Distribution of Ages
Age (years) Frequency
12 – 15 8
16 – 19 8
20 – 23 7
24 – 27 3
28– 31 3
32 - 35 1
Total 30
Comments:
1. there are 6 class intervals.
2. the class width is 4 (difference between 2 successive lower limits: e.g., 16-12, 32-28; or
Difference between 2 successive upper limits; e.g., 31-27, 23-19)
> For purposes of presenting the data using a graph, additional columns are needed:
1. Class boundaries remove the gaps between intervals (there is a gap of 1 between 12 – 15 and 16 – 19,
etc) – this is especially necessary if your data are continuous
no more gap between the first and second intervals: 11.5 – 15.5 and 15.5 – 19.5, etc…
2. Class marks are the midpoints of the class intervals (add the lower limit and upper limit, then divide by
2)
Example: for the first interval: (12 + 15)/2 = 13.5 (do not round off)
Example: Using the age data (Table 4), the table is expanded below:
Class
Age (years) Class Marks Frequency Percentage
Boundaries
12 – 15 11.5 – 15.5 13.5 8 26.7
16 – 19 15.5 – 19.5 17.5 8 26.7
20 – 23 19.5 – 23.5 21.5 7 23.3
24 – 27 23.5 – 27.5 25.5 3 10.0
28– 31 27.5 – 31.5 29.5 3 10.0
32 – 35 31.5 – 35.5 33.5 1 3.3
Total 30 100.0
1. Histogram – A graph consisting of a series of vertical columns or rectangles with no gaps between
bars
• each bar is drawn with a base equal to the class boundaries and a height corresponding to the class
frequency
• a suitable graph for representing data obtained from continuous variables.
LEONARES, S. R. 10
9
8
7
6
Frequency
5
4
3
2
1
0
// 11.5 – 15.5 15.5 – 19.5 19.5 – 23.5 23.5 – 27.5 27.5 – 31.5 31.5 – 35.5
Age (years)
Comment: Consider the boundary line between the bars of the third and fourth intervals to be the
middle value (dividing line between the left and right sides). The shape of the
distribution of ages is positively skewed
2. Frequency Polygon – Constructed by plotting class marks (X) against class frequencies (Y) and
connecting the consecutive points by straight lines
• to close the frequency polygon, additional class marks ( 9.5 and 37.5) are added to both ends of
the distribution, each with zero frequency
6
Frequency
0
9.5 13.5 17.5 21.5 25.5 29.5 33.5 37.5
Age(years)
Comment: The shape is positively skewed – more values concentrated to the left of the blue line than to
the right.
LEONARES, S. R. 11
https://www.youtube.com/watch?v=g530cnFfk8Y
3. How to Create a Dashboard Using Pivot Tables and Charts in Excel (Part 3)
https://www.youtube.com/watch?v=FyggutiBKvU
B. by DannyRocksExcels:
1. Two Ways to Create a Frequency Distribution Report in Excel,
http://www.youtube.com/watch?v=nh5ObAKfj1o&feature=fvsr (preference: use of pivot functions)
1. Mari’s Steakhouse uses a questionnaire to ask customers how they rate the server, food quality,
cocktails, prices, and atmosphere at the restaurant. Each characteristic is rated on a scale of
outstanding (O), very good (V), good (G), average (A), and poor (P). Construct a frequency
distribution, bar graph, and pie chart to summarize the following data collected on food quality.
What is your feeling about the food quality ratings at the restaurant?
G O V G A O V O V G O V A
V O P V O G A O O O G O V
V A G O V P V O O G O O V
O G A O V O O G V A G
2. The following are the final examination test scores of 50 statistics students.
68 45 38 52 54 43 69 44 52 64
55 56 50 54 38 40 54 55 51 55
65 59 37 57 46 29 64 58 53 37
42 56 42 49 49 43 41 55 49 47
64 42 53 63 33 60 63 41 48 50
a. Construct a frequency distribution using 7 classes.
b. Develop a histogram and a frequency polygon.
c. Determine the shape of the distribution.
3. The following data are the scores of 50 individuals who answered a 150-item aptitude test as a
requirement for a job application.
112 107 97 69 72 100 115 106
73 73 86 76 92 119 98
126 124 127 118 128 106 84
82 83 134 132 104 94 75
92 92 100 96 108 85 98
115 81 102 91 76 68 113
95 106 80 81 141 95 119
LEONARES, S. R. 12