Notes 3 Descriptive Statistics RJMurden 2021
Notes 3 Descriptive Statistics RJMurden 2021
Notes 3 Descriptive Statistics RJMurden 2021
Statistical Methods I
Section 003
NOT E S 3:
DESCRIPT IVE STAT IST ICS
DR. RAPHIEL J. MURDEN
2021
Outline of Notes
Types of Variables (Types of Data)
Describing Data
◦ Measures of Centrality and Variability
◦ Frequencies, Percentages
Reading: Chapter 2
2
Types of Variables (Data)
Categorical (qualitative) variable - places
individuals into one of several groups or categories
◦ non-numerical information, cannot perform arithmetic
with this information
3
Example Question of Interest:
Does in-person classroom atmosphere increase the average final
grade as compared to online atmosphere?
7
Categorical Variable:
Ordinal
Ordinal Scale: “order matters”
◦ Classes of Rheumatoid Arthritis: 1 -4
◦ 1: Normal Activity, 4: Wheelchair-bound
◦ Visual Acuity
◦ 20-20, 20-30, 20-40
◦ Example of ordinal with no specific numeric values,
common arithmetic cannot be performed in a
meaningful way
8
Numerical Scale: Discrete
Values can be listed, even if list continues
indefinitely
“Count” variables: equal to integers
Examples
◦ Number of Fractures
◦ Number of Pipes
◦ Number of Uninsured
9
Numerical Scale: Continuous
values on a continuum
◦ Variables whose possible values form some
interval of numbers
Often has “decimal points”
◦ Height, Weight, Length, Lifetime of a light bulb
Though often reported to closest integer
◦ Age of Adults in Years
◦ Age of Young Children in Months
10
What type of variable is
Final Grade?
Type? Categorical, Discrete, Continuous
Final Grade Range: 0 – 100
◦ Final Grade is a ___________
continuous Variable
11
Outline of Notes
Types of Variables (Types of Data)
Describing Data
◦ Measures of Centrality and Variability
◦ Frequencies, Percentages
Reading: Chapter 2
12
What is Statistics?
Statistics is the study of how best to:
◦ (a) design a quantitative study
◦ (b) collect data;
◦ (c) summarize or describe data (a statistic); and
◦ (d) draw formal inferences and practical conclusions based
on data
all the while recognizing the reality of variation.
14
Summarizing Quantitative
Data
In a study examining use of healthcare in the aging population, the age of
the first 5 subjects randomly chosen for inclusion are:
75, 85, 95, 75, 78
Now let’s summarize this data using measures of centrality and spread
15
Measures of Centrality: Mean
Sample Mean: arithmetic average of the observations
(# of observations) read as “the sum
from i=1 to i=n of X
sub i”
16
Measures of Centrality:
Median
Median: the middle observation
◦ Half the observations are below median and half
are above
◦ Procedure for calculating
◦ Arrange observations from smaller to larger
◦ Count in to find middle observation
◦ Odd Number of Observations: median is the middle value
◦ Even Number of Observations: median is the mean of
two middle values
17
Median
Example I: 75, 85, 95, 75, 78
◦ 75, 75, 78, 85, 95
◦ The Median is ____
78
Example II: 2, 5, 1, 8, 4, 3
◦ 1, 2, 3, 4, 5, 8
3.5
◦ The Median is ______
Middle value between 3 and 4
18
Measures of Centrality: Mode
Mode: value that occurs most often
In general, a variable may
Example I: 75, 85, 95, 75, 78 have/be
◦ Mode is _____
75 ◦ No Mode
◦ Unimodal
Example II: 2, 5, 1, 8, 4, 3 ◦ Bimodal
◦ Mode is _________
No mode ◦ Multimodal
Example III: 5,10,16,5,12,13,10,18,23
◦ Mode is ____________
5 and 10 Bimodal
19
Using Measures of Centrality
Which is Best? depends on
◦ Scale of measurement (ordinal or continuous)
◦ Shape of Distribution of Observations (skewed?)
20
Symmetric vs. Skewed Distribution
Symmetric Distribution: has same shape on both sides of mean
Skewed Distributions
21
Measures of Spread
22
Measures of Spread: Range
Range: difference between largest and smallest
observation
Example I: 75, 85, 95, 75, 78
Range = __________
20 = 95-75
◦ Be Careful, can be misleading
◦ Example Ia: 40,75,85,95,75,78,82,88; Range= ____
55
◦ Report the min and max when provided; gives a little more information but still
can be misleading
◦ Example I: Min: __75__, Max: _95__
◦ Example Ia: Min: __40__, Max: _95___
23
Sample Variance and Standard Deviation
Let x
be the mean of a sample . The (sample) variance is
given by
n
1
s2 xi x 2
n 1 i 1
24
Sample Variance and Standard Deviation
If observed values are far apart, the variance and
standard deviation will be relatively _____
large
27
Empirical Rule
If the data are symmetric, we can use the Empirical Rule (or 68-95-
99.7 rule) as an approximation
Roughly 68% of the data fall within one standard deviation of the
mean
Roughly 95% of the data fall within two standard deviations of the
mean
Roughly 99.7 % of the data fall within three standard deviations of
the mean
◦ Note: These numbers come from normal distribution theory, which we’ll see
later.
Ex. Empirical Rule
Assume that we have a symmetric data set with and .
Then,
Roughly 68% of the data fall between
515-1*114=401 and 515+1*114=629
31
Measures of Spread: IQR
Interquartile Range: difference between the first (25th percentile) and
third (75th percentile) quartile
Contains central 50% of observations
Example: 75, 85, 95, 75, 78
75, 75, 78, 85, 95
IQR: ______
32
Measures of Spread: IQR
Interquartile Range: difference between the first (25th percentile) and
third (75th percentile) quartile
Contains central 50% of observations
Example: 75, 85, 95, 75, 78
75, 75, 78, 85, 95
33
Quartiles
Q1 is the median of the ordered observations that are
less than or equal to Q2
Q3 is the median of the ordered observations that lie
to the right of Q2.
34
Quartiles
Ex. 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
There are different conventions for finding quartiles! We choose to use
the following convention: To find Q1 and Q3:
First order the data from smallest to largest
Second find the median
Third find Q1 by finding the middle of the first half of the data
determined by Q2
Fourth find Q3 by finding the middle of the second half of the data
determined by Q2
Quartiles
Data: 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
The median splits the data into two parts. We want the same amount of
observations on both sides of the median. Any point between 74 and 75
would serve the purpose. As a convention, we take the average of the two
(74+75)/2=74.5 as the median.
◦ Use 74 to determine the end point for first half of data
◦ Use 75 to determine the start point for second half of data
Quartiles
Data: 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
Q1=
Q3=
Quartiles
Ex.
56, 67, 68, 72, 74, 75, 88, 90, 97, 99,100
If the number of data points () is odd then our convention is that Q2 is counted as
part of the lower half and upper half
Q2= ; Q1= ; Q3=
Range vs. IQR
Example: 40 , 75, 78, 85, 95
What is the Range? IQR?
Range: 55 (40, 95)
IQR: 10 (75, 85)
_??__ is more robust (less sensitive) than _??___
IQR is not as sensitive to shape of distribution or to extreme
values (outliers)
◦ IQR shows the spread of the middle 50% of the observations in a
dataset.
39
Using Measures of Dispersion
SD: compliment to mean
◦ most useful with symmetric, numerical data
40
Summary Measures for Numerical Data
Measures of Centrality
◦ Mean
◦ Median
◦ Mode
Measure of Variability
◦ Variance (Standard Deviation)
◦ Range (Minimum – Maximum)
◦ Interquartile Range (IQR: 25% - 75%)
41
Summarizing Categorical
Data
Use Frequency and Percentages
◦ Of the 100 subjects, 40 are male
◦ 116 (58%) of the subjects ceased smoking after
the intervention
Total N=200
Give enough information so implicitly know the n in each category and total N
42
Summarizing Categorical Data:
Cross Tabs
Gender
NC 12 (86%) 16 (80%)
43
Outline of Notes
Types of Variables (Types of Data)
Describing Data
◦ Measures of Centrality and Variability
◦ Frequencies, Percentages
Reading: Chapter 2
44
Recall: The Basic Paradigm
Population Sample
Inference
Parameters Statistics
45
Does learning introductory statistics in-person increase retention of
knowledge and skills when compared to learning statistics online?
Data source: Final grades of 100 students enrolled in BIOS 501 SP19
and 100 students enrolled in BIOS 501 SP20