Notes 3 Descriptive Statistics RJMurden 2021

BIOS 500:
Statistical Methods I
Section 003
NOT E S 3:
DESCRIPT IVE STAT IST ICS
DR. RAPHIEL J. MURDEN
2021
Outline of Notes
Types of Variables (Types of Data)
Describing Data
◦ Measures of Centrality and Variability
◦ Frequencies, Percentages
Review Parameter, Statistic, Design
Reading: Chapter 2
2
Types of Variables (Data)
Categorical (qualitative) variable - places
individuals into one of several groups or categories
◦ non-numerical information, cannot perform arithmetic
with this information
Numerical (quantitative) variable - takes numerical

values for which arithmetic operations make sense;
usually counts or measurements
3
Example Question of Interest:
Does in-person classroom atmosphere increase the average final
grade as compared to online atmosphere?
Experimental Units: students in classroom____

Response Variable of Interest: ___final grade_______
Independent Variable of Interest (Factor): classroom atmosphere _
What other variables would you collect?

◦ Age: years (continuous), categorical (before or after internet- 2000)
◦ Current GPA numerical
◦ 1-1 time with professor numerical
◦ Previous online courses (#) numerical
Example: Final Grades
Student Group Gender Grade

A In-person F 70
B Online M 80
C In-person F 95
D Online M 85
… … … …
Which variables are categorical? Group, Gender
Which variables are quantitative? Grade

Example Question of Interest:
Does in-person classroom atmosphere increase the average
final grade as compared to online atmosphere?
Potential Lurking Variables
Go back to your list of other variables you think would be

helpful to collect for this study (i.e. don’t want lurking
variables). Identify whether each is a:
◦ Categorical Variable
◦ Quantitative Variable
Categorical Variable: Nominal
Nominal Scale
◦ Categories without specific ordering
◦ Dichotomous/Binary
◦ HPM track (Policy or Management)
◦ Smoker (1: Yes or 0: No)
◦ Yes or No | Insured or uninsured
◦ More than Two Categories
◦ Colors in a Standard Crayon Box
◦ Ethnicity
◦ Example in your field?
Qualitative Observations: describe quality and/or characteristics
7
Categorical Variable:
Ordinal
Ordinal Scale: “order matters”
◦ Classes of Rheumatoid Arthritis: 1 -4
◦ 1: Normal Activity, 4: Wheelchair-bound
◦ Visual Acuity
◦ 20-20, 20-30, 20-40
◦ Example of ordinal with no specific numeric values,
common arithmetic cannot be performed in a
meaningful way
8
Numerical Scale: Discrete
Values can be listed, even if list continues
indefinitely
“Count” variables: equal to integers
Examples
◦ Number of Fractures
◦ Number of Pipes
◦ Number of Uninsured
9
Numerical Scale: Continuous
values on a continuum
◦ Variables whose possible values form some
interval of numbers
Often has “decimal points”
◦ Height, Weight, Length, Lifetime of a light bulb
Though often reported to closest integer
◦ Age of Adults in Years
◦ Age of Young Children in Months
10
What type of variable is
Final Grade?
Type? Categorical, Discrete, Continuous
Final Grade Range: 0 – 100
◦ Final Grade is a ___________
continuous Variable
Final Grade is: Satisfactory, Unsatisfactory

categorical, binary Variable
◦ Final Grade is a _______________
Final Grade is: A, A-, B+, B, B-, C, F

categorical Variable
◦ Final Grade is a __________
ordinal
11
Outline of Notes
Describing Data
Reading: Chapter 2
12
What is Statistics?
Statistics is the study of how best to:
◦ (a) design a quantitative study
◦ (b) collect data;
◦ (c) summarize or describe data (a statistic); and
◦ (d) draw formal inferences and practical conclusions based
on data
all the while recognizing the reality of variation.
Study Collecting &

Design Organizing Analyzing
Data Data Statistical
conclusions Content area
conclusions
Measures of Centrality &
Spread
14
Summarizing Quantitative
Data
In a study examining use of healthcare in the aging population, the age of
the first 5 subjects randomly chosen for inclusion are:
75, 85, 95, 75, 78
Now let’s summarize this data using measures of centrality and spread
15
Measures of Centrality: Mean
Sample Mean: arithmetic average of the observations
(# of observations) read as “the sum
from i=1 to i=n of X
sub i”
(75 + 85 + 95 + 75 + 78) / 5 = 81.6
16
Measures of Centrality:
Median
Median: the middle observation
◦ Half the observations are below median and half
are above
◦ Procedure for calculating
◦ Arrange observations from smaller to larger
◦ Count in to find middle observation
◦ Odd Number of Observations: median is the middle value
◦ Even Number of Observations: median is the mean of
two middle values
17
Median
Example I: 75, 85, 95, 75, 78
◦ 75, 75, 78, 85, 95
◦ The Median is ____
78
Example II: 2, 5, 1, 8, 4, 3
◦ 1, 2, 3, 4, 5, 8
3.5
◦ The Median is ______
Middle value between 3 and 4
18
Measures of Centrality: Mode
Mode: value that occurs most often
In general, a variable may
Example I: 75, 85, 95, 75, 78 have/be
◦ Mode is _____
75 ◦ No Mode
◦ Unimodal
Example II: 2, 5, 1, 8, 4, 3 ◦ Bimodal
◦ Mode is _________
No mode ◦ Multimodal
Example III: 5,10,16,5,12,13,10,18,23
◦ Mode is ____________
5 and 10  Bimodal
19
Using Measures of Centrality
Which is Best? depends on
◦ Scale of measurement (ordinal or continuous)
◦ Shape of Distribution of Observations (skewed?)
Mean: numerical data; symmetrical dist.

◦ Sensitive to outliers, extreme values
Median: ordinal data, skewed dist.

Mode: bimodal, designate value that occurs most
◦ Can be useful for discrete variables
20
Symmetric vs. Skewed Distribution
Symmetric Distribution: has same shape on both sides of mean
Skewed Distributions
Right Skewed Left Skewed

Positively Skewed Negatively Skewed
21
Measures of Spread
22
Measures of Spread: Range
Range: difference between largest and smallest
observation
Example I: 75, 85, 95, 75, 78
Range = __________
20 = 95-75
◦ Be Careful, can be misleading
◦ Example Ia: 40,75,85,95,75,78,82,88; Range= ____
55
◦ Report the min and max when provided; gives a little more information but still
can be misleading
◦ Example I: Min: __75__, Max: _95__
◦ Example Ia: Min: __40__, Max: _95___
23
Sample Variance and Standard Deviation
Let x
be the mean of a sample . The (sample) variance is
given by
n
1
s2    xi  x  2
n  1 i 1
**The (sample) standard deviation is the non-negative

square root of the variance.**
24
Sample Variance and Standard Deviation
If observed values are far apart, the variance and
standard deviation will be relatively _____
large
Variance and standard deviation are strongly affected

by outliers
Properties of sample variance
Note that variance and standard deviation are
greater than or equal to __zero__.
They are only _zero_when all the observed values are equal
If we multiply all of the numbers in the data set by a
non-zero constant c, the effect is to
multiply the variance by the square of c
multiply the standard deviation by c
Calculate the SD and Variance
By Hand!
Example I: 75, 85, 95, 75, 78
◦ Mean 81.6
◦ Var 72.8
◦ SD 8.53
Example II: 20, 40, 10, 90

◦ Mean 40
◦ Var 1266.67
= ++ ◦ SD 35.59
27
Empirical Rule
If the data are symmetric, we can use the Empirical Rule (or 68-95-
99.7 rule) as an approximation
Roughly 68% of the data fall within one standard deviation of the
mean
Roughly 95% of the data fall within two standard deviations of the
mean
Roughly 99.7 % of the data fall within three standard deviations of
the mean
◦ Note: These numbers come from normal distribution theory, which we’ll see
later.
Ex. Empirical Rule

Assume that we have a symmetric data set with and .
Then,
Roughly 68% of the data fall between
515-1*114=401 and 515+1*114=629
Roughly 95% of the data fall between _____ and _____

Roughly 99.7% (almost all) of the data fall between
_______ and ______
Why do we compute
variance?
Two basketball players with the same shooting percentage may be very
different in terms of consistency.
 Which one will you choose to shot for technical foul?
Two companies may have the same average salary, but very different
distributions.
 You want to know about the distribution.
Thus we need to know the spread, or the variability of the values.

Measures of Spread:
Percentiles

Percentile/Quantile: value that the percentage of
values in a distribution is less than or equal to
◦ Example Growth Charts for Young Children
◦ For girls 21 months of age, the 95th percentile is 24 lbs 
95% of 21 mo. Girls weigh 24 lbs or less
◦ Quartiles: 25th, 50th, and 75th percentiles
◦ Often denoted
◦ 50th Percentile is the ________
◦ Percentiles often used to Compare Individual Value to
a Norm
31
Measures of Spread: IQR
Interquartile Range: difference between the first (25th percentile) and
third (75th percentile) quartile
Contains central 50% of observations
Example: 75, 85, 95, 75, 78
75, 75, 78, 85, 95
IQR: ______
32
Measures of Spread: IQR
Interquartile Range: difference between the first (25th percentile) and
third (75th percentile) quartile
Contains central 50% of observations
Example: 75, 85, 95, 75, 78
75, 75, 78, 85, 95
 50% of the data falls between 75 and 85; so IQR= 10
Provide more info by giving

IQR (Q1, Q3)= 10 (75, 85)
33
Quartiles
Q1 is the median of the ordered observations that are
less than or equal to Q2
Q3 is the median of the ordered observations that lie
to the right of Q2.
Interquartile Range (IQR) - gives the spread of the

middle 50% of the data.
IQR = Q3 – Q1
34
Quartiles
Ex. 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
There are different conventions for finding quartiles! We choose to use
the following convention: To find Q1 and Q3:
First order the data from smallest to largest
Second find the median
Third find Q1 by finding the middle of the first half of the data
determined by Q2
Fourth find Q3 by finding the middle of the second half of the data
determined by Q2
Quartiles
Data: 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
The median splits the data into two parts. We want the same amount of
observations on both sides of the median. Any point between 74 and 75
would serve the purpose. As a convention, we take the average of the two
(74+75)/2=74.5 as the median.
◦ Use 74 to determine the end point for first half of data
◦ Use 75 to determine the start point for second half of data
Quartiles
Data: 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
Q1=
Q3=
Quartiles
Ex.
56, 67, 68, 72, 74, 75, 88, 90, 97, 99,100
If the number of data points () is odd then our convention is that Q2 is counted as
part of the lower half and upper half
Q2= ; Q1= ; Q3=
Range vs. IQR
Example: 40 , 75, 78, 85, 95
What is the Range? IQR?
Range: 55 (40, 95)
IQR: 10 (75, 85)
_??__ is more robust (less sensitive) than _??___
IQR is not as sensitive to shape of distribution or to extreme
values (outliers)
◦ IQR shows the spread of the middle 50% of the observations in a
dataset.
39
Using Measures of Dispersion
SD: compliment to mean
◦ most useful with symmetric, numerical data
Percentiles and IQR:

◦ Use with median for ordinal or skewed data
◦ Use with mean when objective is to compare individual observations
with a set of norms
IQR: describe central 50% of distribution, regardless of shape

Range: emphasize extreme values
40
Summary Measures for Numerical Data
Measures of Centrality
◦ Mean
◦ Median
◦ Mode
Measure of Variability
◦ Variance (Standard Deviation)
◦ Range (Minimum – Maximum)
◦ Interquartile Range (IQR: 25% - 75%)
41
Summarizing Categorical
Data
Use Frequency and Percentages
◦ Of the 100 subjects, 40 are male
◦ 116 (58%) of the subjects ceased smoking after
the intervention
Total N=200
Give enough information so implicitly know the n in each category and total N
42
Summarizing Categorical Data:
Cross Tabs
Gender
Place of Birth of a sample of

NCSU Undergraduates Female Male
(n=14) (n=20)
Outside of NC 2 (14%) 4 (20%)
NC 12 (86%) 16 (80%)
Row or Column Percentages?
43
Outline of Notes
Describing Data
Reading: Chapter 2
44
Recall: The Basic Paradigm
Population Sample
Inference
Parameters Statistics
45
Does learning introductory statistics in-person increase retention of
knowledge and skills when compared to learning statistics online?
Data source: Final grades of 100 students enrolled in BIOS 501 SP19
and 100 students enrolled in BIOS 501 SP20
Give the following:

Population: _____biostat students_____
Sample: _200 students__
Parameter: the final grade for anyone taking an introductory stat course
Statistic:__average grade________
Note a parameter and a statistic have 3 parts
1) Which data are we collecting
2) How are we summarizing this data
3) From whom are we collecting and summarizing
Does learning introductory statistics in-person increase retention of
knowledge and skills when compared to learning statistics online?
Data source: Final grades of 100 students enrolled in BIOS 501
SP19 and 100 students enrolled in BIOS 501 SP20
Comment on Design Aspects:

Observational or Experimental
Cross-sectional or Longitudinal Outcome
Retrospective or Prospective
Descriptive or Inferential?
Random Sampling? Random Assignment? Neither
Comment on what additional information you would need to be confident

in your answers

Notes 3 Descriptive Statistics RJMurden 2021

Uploaded by

Copyright:

Available Formats

Notes 3 Descriptive Statistics RJMurden 2021

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes 3 Descriptive Statistics RJMurden 2021

Uploaded by

Copyright:

Available Formats

BIOS 500:

Review Parameter, Statistic, Design

Numerical (quantitative) variable - takes numerical

Experimental Units: students in classroom____

What other variables would you collect?

Student Group Gender Grade

Which variables are categorical? Group, Gender

Which variables are quantitative? Grade

Potential Lurking Variables

Go back to your list of other variables you think would be

Qualitative Observations: describe quality and/or characteristics

Final Grade is: Satisfactory, Unsatisfactory

Final Grade is: A, A-, B+, B, B-, C, F

Review Parameter, Statistic, Design

Study Collecting &

(75 + 85 + 95 + 75 + 78) / 5 = 81.6

Mean: numerical data; symmetrical dist.

Median: ordinal data, skewed dist.

Right Skewed Left Skewed

**The (sample) standard deviation is the non-negative

Variance and standard deviation are strongly affected

Example II: 20, 40, 10, 90

Roughly 95% of the data fall between _____ and _____

Thus we need to know the spread, or the variability of the values.

 50% of the data falls between 75 and 85; so IQR= 10

Provide more info by giving

Interquartile Range (IQR) - gives the spread of the

Percentiles and IQR:

IQR: describe central 50% of distribution, regardless of shape

Place of Birth of a sample of

Outside of NC 2 (14%) 4 (20%)

Row or Column Percentages?

Review Parameter, Statistic, Design

Give the following:

Comment on Design Aspects:

Comment on what additional information you would need to be confident

You might also like

Roughly 95% of the data fall between _ and _