Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Notes 3 Descriptive Statistics RJMurden 2021

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 47

BIOS 500:

Statistical Methods I
Section 003

NOT E S 3:
DESCRIPT IVE STAT IST ICS
DR. RAPHIEL J. MURDEN
2021
Outline of Notes
Types of Variables (Types of Data)
Describing Data
◦ Measures of Centrality and Variability
◦ Frequencies, Percentages

Review Parameter, Statistic, Design

Reading: Chapter 2

2
Types of Variables (Data)
Categorical (qualitative) variable - places
individuals into one of several groups or categories
◦ non-numerical information, cannot perform arithmetic
with this information

Numerical (quantitative) variable - takes numerical


values for which arithmetic operations make sense;
usually counts or measurements

3
Example Question of Interest:
Does in-person classroom atmosphere increase the average final
grade as compared to online atmosphere?

Experimental Units: students in classroom____


Response Variable of Interest: ___final grade_______
Independent Variable of Interest (Factor): classroom atmosphere _

What other variables would you collect?


◦ Age: years (continuous), categorical (before or after internet- 2000)
◦ Current GPA numerical
◦ 1-1 time with professor numerical
◦ Previous online courses (#) numerical
Example: Final Grades

Student Group Gender Grade


A In-person F 70
B Online M 80
C In-person F 95
D Online M 85
… … … …

Which variables are categorical? Group, Gender

Which variables are quantitative? Grade


Example Question of Interest:
Does in-person classroom atmosphere increase the average
final grade as compared to online atmosphere?

Potential Lurking Variables

Go back to your list of other variables you think would be


helpful to collect for this study (i.e. don’t want lurking
variables). Identify whether each is a:
◦ Categorical Variable
◦ Quantitative Variable
Categorical Variable: Nominal
Nominal Scale
◦ Categories without specific ordering
◦ Dichotomous/Binary
◦ HPM track (Policy or Management)
◦ Smoker (1: Yes or 0: No)
◦ Yes or No | Insured or uninsured
◦ More than Two Categories
◦ Colors in a Standard Crayon Box
◦ Ethnicity
◦ Example in your field?

Qualitative Observations: describe quality and/or characteristics

7
Categorical Variable:
Ordinal
Ordinal Scale: “order matters”
◦ Classes of Rheumatoid Arthritis: 1 -4
◦ 1: Normal Activity, 4: Wheelchair-bound
◦ Visual Acuity
◦ 20-20, 20-30, 20-40
◦ Example of ordinal with no specific numeric values,
common arithmetic cannot be performed in a
meaningful way

8
Numerical Scale: Discrete
Values can be listed, even if list continues
indefinitely
“Count” variables: equal to integers
Examples
◦ Number of Fractures
◦ Number of Pipes
◦ Number of Uninsured

9
Numerical Scale: Continuous
values on a continuum
◦ Variables whose possible values form some
interval of numbers
Often has “decimal points”
◦ Height, Weight, Length, Lifetime of a light bulb
Though often reported to closest integer
◦ Age of Adults in Years
◦ Age of Young Children in Months

10
What type of variable is
Final Grade?
Type? Categorical, Discrete, Continuous
Final Grade Range: 0 – 100
◦ Final Grade is a ___________
continuous Variable

Final Grade is: Satisfactory, Unsatisfactory


categorical, binary Variable
◦ Final Grade is a _______________

Final Grade is: A, A-, B+, B, B-, C, F


categorical Variable
◦ Final Grade is a __________
ordinal

11
Outline of Notes
Types of Variables (Types of Data)
Describing Data
◦ Measures of Centrality and Variability
◦ Frequencies, Percentages

Review Parameter, Statistic, Design

Reading: Chapter 2

12
What is Statistics?
Statistics is the study of how best to:
◦ (a) design a quantitative study
◦ (b) collect data;
◦ (c) summarize or describe data (a statistic); and
◦ (d) draw formal inferences and practical conclusions based
on data
all the while recognizing the reality of variation.

Study Collecting &


Design Organizing Analyzing
Data Data Statistical
conclusions Content area
conclusions
Measures of Centrality &
Spread

14
Summarizing Quantitative
Data
In a study examining use of healthcare in the aging population, the age of
the first 5 subjects randomly chosen for inclusion are:
75, 85, 95, 75, 78

Now let’s summarize this data using measures of centrality and spread

15
Measures of Centrality: Mean
 Sample Mean: arithmetic average of the observations
(# of observations)    read as “the sum
from i=1 to i=n of X
sub i”

(75 + 85 + 95 + 75 + 78) / 5 = 81.6

16
Measures of Centrality:
Median
Median: the middle observation
◦ Half the observations are below median and half
are above
◦ Procedure for calculating
◦ Arrange observations from smaller to larger
◦ Count in to find middle observation
◦ Odd Number of Observations: median is the middle value
◦ Even Number of Observations: median is the mean of
two middle values

17
Median
Example I: 75, 85, 95, 75, 78
◦ 75, 75, 78, 85, 95
◦ The Median is ____
78

Example II: 2, 5, 1, 8, 4, 3
◦ 1, 2, 3, 4, 5, 8
3.5
◦ The Median is ______
Middle value between 3 and 4

18
Measures of Centrality: Mode
Mode: value that occurs most often
In general, a variable may
Example I: 75, 85, 95, 75, 78 have/be
◦ Mode is _____
75 ◦ No Mode
◦ Unimodal
Example II: 2, 5, 1, 8, 4, 3 ◦ Bimodal
◦ Mode is _________
No mode ◦ Multimodal
Example III: 5,10,16,5,12,13,10,18,23
◦ Mode is ____________
5 and 10  Bimodal

19
Using Measures of Centrality
Which is Best? depends on
◦ Scale of measurement (ordinal or continuous)
◦ Shape of Distribution of Observations (skewed?)

Mean: numerical data; symmetrical dist.


◦ Sensitive to outliers, extreme values

Median: ordinal data, skewed dist.


Mode: bimodal, designate value that occurs most
◦ Can be useful for discrete variables

20
Symmetric vs. Skewed Distribution
Symmetric Distribution: has same shape on both sides of mean

Skewed Distributions

Right Skewed Left Skewed


Positively Skewed Negatively Skewed

21
Measures of Spread

22
Measures of Spread: Range
Range: difference between largest and smallest
observation
Example I: 75, 85, 95, 75, 78
Range = __________
20 = 95-75
◦ Be Careful, can be misleading
◦ Example Ia: 40,75,85,95,75,78,82,88; Range= ____
55
◦ Report the min and max when provided; gives a little more information but still
can be misleading
◦ Example I: Min: __75__, Max: _95__
◦ Example Ia: Min: __40__, Max: _95___

23
Sample Variance and Standard Deviation

 Let x
be the mean of a sample . The (sample) variance is
given by

n
1
s2    xi  x  2

n  1 i 1

**The (sample) standard deviation is the non-negative


square root of the variance.**

24
Sample Variance and Standard Deviation
If observed values are far apart, the variance and
standard deviation will be relatively _____
large

Variance and standard deviation are strongly affected


by outliers
Properties of sample variance
Note that variance and standard deviation are
greater than or equal to __zero__.
They are only _zero_when all the observed values are equal
If we multiply all of the numbers in the data set by a
non-zero constant c, the effect is to
multiply the variance by the square of c
multiply the standard deviation by c
Calculate the SD and Variance
By Hand!
Example I: 75, 85, 95, 75, 78
◦ Mean 81.6
◦ Var 72.8
◦ SD 8.53

Example II: 20, 40, 10, 90


◦ Mean 40
  ◦ Var 1266.67
= ++ ◦ SD 35.59

27
Empirical Rule
If the data are symmetric, we can use the Empirical Rule (or 68-95-
99.7 rule) as an approximation
Roughly 68% of the data fall within one standard deviation of the
mean
Roughly 95% of the data fall within two standard deviations of the
mean
Roughly 99.7 % of the data fall within three standard deviations of
the mean

◦ Note: These numbers come from normal distribution theory, which we’ll see
later.
Ex. Empirical Rule
 
Assume that we have a symmetric data set with and .
Then,
Roughly 68% of the data fall between
515-1*114=401 and 515+1*114=629

Roughly 95% of the data fall between _____ and _____


Roughly 99.7% (almost all) of the data fall between
_______ and ______
Why do we compute
variance?
Two basketball players with the same shooting percentage may be very
different in terms of consistency.
 Which one will you choose to shot for technical foul?
Two companies may have the same average salary, but very different
distributions.
 You want to know about the distribution.

Thus we need to know the spread, or the variability of the values.


Measures of Spread:
Percentiles
 
Percentile/Quantile: value that the percentage of
values in a distribution is less than or equal to
◦ Example Growth Charts for Young Children
◦ For girls 21 months of age, the 95th percentile is 24 lbs 
95% of 21 mo. Girls weigh 24 lbs or less
◦ Quartiles: 25th, 50th, and 75th percentiles
◦ Often denoted
◦ 50th Percentile is the ________
◦ Percentiles often used to Compare Individual Value to
a Norm

31
Measures of Spread: IQR
Interquartile Range: difference between the first (25th percentile) and
third (75th percentile) quartile
Contains central 50% of observations
Example: 75, 85, 95, 75, 78
75, 75, 78, 85, 95

IQR: ______

32
Measures of Spread: IQR
Interquartile Range: difference between the first (25th percentile) and
third (75th percentile) quartile
Contains central 50% of observations
Example: 75, 85, 95, 75, 78
75, 75, 78, 85, 95

 50% of the data falls between 75 and 85; so IQR= 10

Provide more info by giving


IQR (Q1, Q3)= 10 (75, 85)

33
Quartiles
Q1 is the median of the ordered observations that are
less than or equal to Q2
Q3 is the median of the ordered observations that lie
to the right of Q2.

Interquartile Range (IQR) - gives the spread of the


middle 50% of the data.
IQR = Q3 – Q1

34
Quartiles
Ex. 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
There are different conventions for finding quartiles! We choose to use
the following convention: To find Q1 and Q3:
First order the data from smallest to largest
Second find the median
Third find Q1 by finding the middle of the first half of the data
determined by Q2
Fourth find Q3 by finding the middle of the second half of the data
determined by Q2
Quartiles
Data: 56, 67, 68, 72, 74, 75, 88, 90, 97, 99
The median splits the data into two parts. We want the same amount of
observations on both sides of the median. Any point between 74 and 75
would serve the purpose. As a convention, we take the average of the two
(74+75)/2=74.5 as the median.
◦ Use 74 to determine the end point for first half of data
◦ Use 75 to determine the start point for second half of data
Quartiles
Data: 56, 67, 68, 72, 74, 75, 88, 90, 97, 99

Q1=

Q3=
Quartiles
Ex.
  56, 67, 68, 72, 74, 75, 88, 90, 97, 99,100
If the number of data points () is odd then our convention is that Q2 is counted as
part of the lower half and upper half
Q2= ; Q1= ; Q3=
Range vs. IQR
Example: 40 , 75, 78, 85, 95
What is the Range? IQR?
Range: 55 (40, 95)
IQR: 10 (75, 85)
_??__ is more robust (less sensitive) than _??___
IQR is not as sensitive to shape of distribution or to extreme
values (outliers)
◦ IQR shows the spread of the middle 50% of the observations in a
dataset.

39
Using Measures of Dispersion
SD: compliment to mean
◦ most useful with symmetric, numerical data

Percentiles and IQR:


◦ Use with median for ordinal or skewed data
◦ Use with mean when objective is to compare individual observations
with a set of norms

IQR: describe central 50% of distribution, regardless of shape


Range: emphasize extreme values

40
Summary Measures for Numerical Data

Measures of Centrality
◦ Mean
◦ Median
◦ Mode

Measure of Variability
◦ Variance (Standard Deviation)
◦ Range (Minimum – Maximum)
◦ Interquartile Range (IQR: 25% - 75%)

41
Summarizing Categorical
Data
Use Frequency and Percentages
◦ Of the 100 subjects, 40 are male
◦ 116 (58%) of the subjects ceased smoking after
the intervention
Total N=200

Give enough information so implicitly know the n in each category and total N

42
Summarizing Categorical Data:
Cross Tabs

Gender

Place of Birth of a sample of


NCSU Undergraduates Female Male
(n=14) (n=20)

Outside of NC 2 (14%) 4 (20%)

NC 12 (86%) 16 (80%)

Row or Column Percentages?

43
Outline of Notes
Types of Variables (Types of Data)
Describing Data
◦ Measures of Centrality and Variability
◦ Frequencies, Percentages

Review Parameter, Statistic, Design

Reading: Chapter 2

44
Recall: The Basic Paradigm
Population Sample

Inference

Parameters Statistics

45
Does learning introductory statistics in-person increase retention of
knowledge and skills when compared to learning statistics online?
Data source: Final grades of 100 students enrolled in BIOS 501 SP19
and 100 students enrolled in BIOS 501 SP20

Give the following:


Population: _____biostat students_____
Sample: _200 students__
Parameter: the final grade for anyone taking an introductory stat course
Statistic:__average grade________
Note a parameter and a statistic have 3 parts
1) Which data are we collecting
2) How are we summarizing this data
3) From whom are we collecting and summarizing
Does learning introductory statistics in-person increase retention of
knowledge and skills when compared to learning statistics online?
Data source: Final grades of 100 students enrolled in BIOS 501
SP19 and 100 students enrolled in BIOS 501 SP20

Comment on Design Aspects:


Observational or Experimental
Cross-sectional or Longitudinal Outcome
Retrospective or Prospective
Descriptive or Inferential?
Random Sampling? Random Assignment? Neither

Comment on what additional information you would need to be confident


in your answers

You might also like