02 Simple-Regression-An-Overview Simple Regression
02 Simple-Regression-An-Overview Simple Regression
► In this set of lectures, we will develop a framework for simple linear, logistic, Poisson, and
Cox proportional hazards regression in the first section
► The remaining sections will focus on simple linear regression, a general framework for
estimating the mean of a continuous outcome based on a single predictor (which may be
binary, categorical, or continuous)
2
Simple Regression: An Overview
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Objectives
4
Link to Methods From Term 1—1
► Regression provides a general framework for the estimation and testing procedures that
we covered in the first term
► All methods we covered in term 1 can be done as simple regression models
► Additionally, these models can be extended to allow for analyses beyond the scope of
comparing outcomes across levels of a single predictor (adjustment, prediction with
multiple predictors)
5
Link to Methods From Term 1—2
► For example:
► Comparing means between two or more groups (mean difference(s), t-test, ANOVA)
can be done via a simple linear regression model
► Comparing proportions between two or more groups (odds ratio, chi-square) can be
done via a simple logistic regression model
► Comparing incidence rates between two or more groups (incidence rate ratio, log rank)
can be done via a simple Poisson or Cox proportional hazards regression model
6
Basic Structure of a Simple Regression Model
7
Basic Structure: The Left-Hand Side—1
► The left-hand side (𝑓𝑓(𝑦𝑦)) depends on what variable type the outcome of interest (𝑦𝑦) is:
► For continuous outcomes, the left-hand side is the mean of the outcome 𝑦𝑦, 𝑦𝑦,
� and the
regression type is linear regression
► For binary outcomes, the left-hand side is the 𝑙𝑙𝑙𝑙(𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜) of the binary outcome, i.e.:
𝑝𝑝
𝑙𝑙𝑙𝑙 , where p is the probability that y = 1 (proportion of y values equal to 1)
1 − 𝑝𝑝
8
Basic Structure: The Left-Hand Side—2
► The left-hand side (𝑓𝑓(𝑦𝑦)) depends on what variable type the outcome of interest (𝑦𝑦) is:
► For time-to-event outcomes where the individual event and censoring times are not
known, y is yes/no indicator of whether the event occurred in the common follow-up
period; the left-hand side is 𝑙𝑙𝑙𝑙(𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟), and the regression type is Poisson
regression
► For time-to-event outcomes where the individual event and censoring times are
known, y is a composite outcome taking into account both the time and whether the
event occurred; the left-hand side is 𝑙𝑙𝑙𝑙 ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 , and the regression type is Cox
regression
9
Basic Structure: The Right-Hand Side
► The right-hand side, 𝛽𝛽𝑜𝑜 + 𝛽𝛽1 𝑥𝑥1, includes the predictor of interest, “𝑥𝑥1”
► This predictor of interest can be continuous, binary, or categorical (in which case, it will be
represented by more than one 𝑥𝑥, as we’ll see)
10
Interpretations of Results When Predictor Is Binary
► There are only two possible values of 𝑥𝑥1: 1 (female) and 0 (male)
► 𝑊𝑊𝑊𝑊𝑊𝑊𝑊 𝑥𝑥1 = 1: 𝐿𝐿𝐿𝐿𝐿𝐿 = 𝛽𝛽0 + 𝛽𝛽1 1 = 𝛽𝛽0 + 𝛽𝛽1
► 𝑊𝑊𝑊𝑊𝑊𝑊𝑊 𝑥𝑥1 = 0: 𝐿𝐿𝐿𝐿𝐿𝐿 = 𝛽𝛽0 + 𝛽𝛽1 0 = 𝛽𝛽0
► Interpretations:
► 𝛽𝛽0 = 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑜𝑜𝑜𝑜 𝐿𝐿𝐿𝐿𝐿𝐿 𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑥𝑥1 = 0 (𝑖𝑖. 𝑒𝑒. , 𝑓𝑓𝑓𝑓𝑓𝑓 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚)
► 𝛽𝛽1 = 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑖𝑖𝑖𝑖 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑜𝑜𝑜𝑜 𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑓𝑓𝑓𝑓 𝑥𝑥1 = 1, 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑡𝑡𝑡𝑡 𝑥𝑥1 =
0 (𝑖𝑖. 𝑒𝑒. , 𝑓𝑓𝑓𝑓𝑓𝑓 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑡𝑡𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚)
11
Interpretations of Results When Predictor Is Nominal—1
► How to code 𝑥𝑥 when the predictor of interest is a nominal category, for example, clinic
site (Hopkins, U of Maryland, U of Michigan)
► For handling multiple nominal categories, the approach is to designate one of the groups
as the “reference category” and create binary 𝑥𝑥’s for each of the other groups
► For example, if we make Hopkins the reference, we will need additional variables:
12
Interpretations of Results When Predictor Is Nominal—2
► The resulting regression model is 𝐿𝐿𝐿𝐿𝐿𝐿 = 𝛽𝛽𝑜𝑜 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 , where LHS = “left-hand side”
► Interpretations:
13
Interpretations of Results When Predictor Is Continuous
► The beauty of regression is that it allows for continuous predictors, unlike the methods we
learned in Term 1
► This is an efficient approach to handling measurements that are made continuously (age,
height, etc.) without arbitrarily having to categorize them (if the outcome/predictor
association is well characterized by a line)
► For example, suppose x1 is age in years, and the regression equation is 𝐿𝐿𝐿𝐿𝐿𝐿 = 𝛽𝛽𝑜𝑜 +
𝛽𝛽1 𝑥𝑥1
14
The Intercept
► The intercept, 𝛽𝛽𝑜𝑜 , is the value of the left-hand side (LHS) when 𝑥𝑥1 = 0
► It is the point on the graph where the line crosses the vertical (y) axis, at the
coordinate (0, 𝛽𝛽𝑜𝑜 )
15
The Slope—1
► The slope, 𝛽𝛽1 , is the change in the left-hand side, corresponding to a unit increase in x1
16
The Slope—2
► The slope 𝛽𝛽1 is the change in the left-hand side, corresponding to a unit increase in x1: in
other words , 𝛽𝛽1 is difference in the left-hand side for 𝑥𝑥1 + 1, compared to x1
► This change/difference is the same across the entire line
17
The Slope—3
► All information about the difference in the left-hand side for two differing values of x1 is
contained in the slope!
► For example: two values of x1 three units apart will have a difference in left-hand side
values of 3 ∗ 𝛽𝛽1
18
Summary
19
Simple Linear Regression with a Binary
Predictor
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Learning Objectives
► Understand that linear regression provides a framework for estimating means and mean
differences
► Interpret the estimated slope and intercept from a simple linear regression model with a
binary predictor
2
The Left-Hand Side (LHS)—1
► For linear regression, the equation is relatively straightforward: the regression models the
mean value of a continuous outcome (𝑦𝑦) as a function of the predictor 𝑥𝑥1
► As noted in the previous section, 𝑥𝑥1 can be binary, nominal categorical (in which case, it
will be represented by more than one 𝑥𝑥), or continuous
► As with everything else we have done thus far, we will only be able to estimate the
regression equation from a sample of data: to indicate the estimates, we can write as
𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1 , which is also frequently expressed as 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1
3
The Left-Hand Side (LHS)—2
► For a given value of 𝑥𝑥1, we can estimate the mean of 𝑦𝑦 via the equation 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1
► The slope compares the mean value of 𝑦𝑦 for two groups who differ by one unit of 𝑥𝑥1 and,
hence, is interpretable as a mean difference
4
Arm Circumference and Sex of Child—1
► Data on anthropometric measures from a random sample of 150 Nepali children 0–12
months old
► Question: What is the relationship between average arm circumference and sex of a
child?
► Data:
► Arm circumference: Mean 12.4 cm, SD 1.5 cm, range 7.3 cm–15.6 cm
► Sex: 51% female
5
Arm Circumference and Sex of Child—2
► Boxplot for these data: 𝑦𝑦�𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 12.5 cm, and 𝑦𝑦�𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 12.37 cm
6
Arm Circumference and Sex of Child—3
► Scatterplot
7
Arm Circumference and Sex of Child—4
► Here, 𝑦𝑦 is arm circumference, a continuous measure: 𝑥𝑥1 is not continuous, but binary—
male or female
8
Arm Circumference and Sex of Child—5
► Notice: This equation is only estimating two values—mean arm circumference for male
children and mean arm circumference for female children
► For female children: 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 1
► For male children: 𝑦𝑦� = 𝛽𝛽̂0
► So the slope, 𝛽𝛽̂1 , estimates the mean difference in arm circumference for female children
compared to male children
► It is the difference in 𝑦𝑦� for a one-unit difference in x1, but the only possible one-unit
difference in 𝑥𝑥1 is 1 (the difference between 𝑥𝑥1 = 1 and 𝑥𝑥1 = 0)
9
Arm Circumference and Sex of Child—6
► The resulting equation (estimated via a computer): 𝑦𝑦� = 12.5 + −0.13 𝑥𝑥1
► 𝛽𝛽̂1 = −0.13: The estimated mean difference in arm circumference for female children
compared to male children is −0.13 cm; female children have lower arm circumference
by 0.13 cm on average
► 𝛽𝛽̂0 = 12.5: The mean arm circumference for male children (reference group) is 12.5 cm
► 𝛽𝛽̂0 + 𝛽𝛽̂1 = 12.5 𝑐𝑐𝑐𝑐 + −0.13𝑐𝑐𝑐𝑐 = 12.37, the mean arm circumference for female
children
10
Arm Circumference and Sex of Child—7
► What does the resulting regression line look like in this example?
11
Arm Circumference and Sex of Child—8
► The variable 𝑥𝑥1 could instead have been coded as 1 for male children and 0 for female
children, and the following regression line could be fit: �𝑦𝑦 = 𝛽𝛽̂0∗ + 𝛽𝛽̂1∗ 𝑥𝑥1 (I added a “*” to
each beta simply to distinguish these values from the values under the original coding for
x1 )
► Can you figure out, based on the previous regression results given, the values of 𝛽𝛽̂0∗
and 𝛽𝛽̂1∗ , based on the same data? (I will show how to ascertain this in the Additional
Examples section as well)
12
Hospital Length of Stay and Age of Visit—1
► For these data, 𝑦𝑦�≤40 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = 2.74 days and 𝑦𝑦�>40 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = 4.87 days
► We can use an equation of the form 𝑦𝑦� = 𝛽𝛽̂𝑜𝑜 + 𝛽𝛽̂1 𝑥𝑥1 , where 𝑦𝑦� is mean length of stay and
𝑥𝑥1 = 1 for persons >40 years old and 0 for persons ≤40 years old
► The resulting equation for these data (estimated via computer) is 𝑦𝑦� = 2.74 + 2.13𝑥𝑥1
► For these results:
► 𝛽𝛽̂0 = 2.74 days, the mean length of stay for persons ≤40 years old
► 𝛽𝛽̂1 = 2.13 days, the mean difference in length of stay for persons >40 years old
compared to persons ≤40 years old
► 𝛽𝛽̂0 + 𝛽𝛽̂1 = 2.74 days + 2.13 days = 4.87 days, the mean length of stay for persons
>40 years old
14
Summary
► Simple linear regression is a method for estimating the relationship between the mean
value of an continuous outcome, 𝑦𝑦, and a predictor 𝑥𝑥1, via a linear equation
𝑦𝑦� = 𝛽𝛽̂0 + �𝛽𝛽1 𝑥𝑥1
► When 𝑥𝑥1 is binary, the slope 𝛽𝛽̂1 estimates the mean difference in 𝑦𝑦 for the group with
𝑥𝑥1 = 1 compared to the group with 𝑥𝑥1 = 0; the intercept 𝛽𝛽̂0 is the estimated mean of 𝑦𝑦
for the group with 𝑥𝑥1 = 0
15
Simple Linear Regression with a
Categorical Predictor
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Learning Objectives
► Understand that linear regression provides a framework for estimating means and mean
differences
► Interpret the estimated slope(s) and intercept from a simple linear regression model with
a nominal categorical predictor
2
Categorical Predictors
► Sometimes, regression scenarios include predictors that are not continuous, not binary,
but multi-categorical
► Examples
► Subject’s race (White, African American, Hispanic, Asian, Other)
► City of residence (Baltimore, Chicago, Tokyo, Madrid)
3
US Academic Physician Salaries: Regional Differences—1
► Question:
► Do average salaries differ by geographical region, and, if so, what is the magnitude of
these differences?
Source: Jagsi, R., et al. (2012). Gender differences in the salaries of physician researchers. JAMA, 307(22), 2410–2417. 4
US Academic Physician Salaries: Regional Differences—2
5
U.S. Academic Physician Salaries: Regional Differences—3
► Approach 1: Arbitrarily give each region a numerical value (𝑥𝑥1 = 1 for West, 2 for
Midwest, 3 for South, and 4 for Northeast, for example), and fit SLR of 𝑦𝑦� = 𝛽𝛽̂𝑜𝑜 + 𝛽𝛽̂1 𝑥𝑥1,
where 𝑦𝑦� is estimated mean salary
► BAD IDEA!!!!
► Coding is arbitrary; we could have assigned 𝑥𝑥1 = 1 for Midwest, etc.
► The value of 𝛽𝛽̂1 will depend on arbitrary coding, and the results will not be equivalent
for different coding schema
► Coding “assumes” mean salary differences between regions is “incremental”
● Example—difference in average salaries between physicians in South (𝑥𝑥1 = 3) and
West (𝑥𝑥1 = 1) is twice the difference between physicians in Midwest (𝑥𝑥1 = 2) and
West (𝑥𝑥1 = 1)
6
US Academic Physician Salaries: Regional Differences—4
► We’ll show that this approach does not force a structure on the 𝑦𝑦/𝑥𝑥 association that
depends on the coding and that the overall results will be equivalent regardless of what
reference group is used for the comparisons
7
US Academic Physician Salaries: Regional Differences—5
► Fit the regression model 𝑦𝑦� = 𝛽𝛽̂𝑜𝑜 + 𝛽𝛽̂1 𝑥𝑥1 + 𝛽𝛽̂2 𝑥𝑥2 + 𝛽𝛽̂3 𝑥𝑥3
► Here, each of the 3 slopes (𝛽𝛽̂1 , 𝛽𝛽̂2 , and 𝛽𝛽̂3 ) estimates mean salary difference between a
region that has a corresponding 𝑥𝑥 value of 1 and the reference region (Western states)
► The intercept, 𝛽𝛽̂𝑜𝑜 , is the estimated mean salary for physicians from the West
8
US Academic Physician Salaries: Regional Differences—6
► For example:
► For physicians in the Midwest (𝑥𝑥1 = 1, 𝑥𝑥2 = 0, 𝑥𝑥3 = 0), the model predicts
► For physicians in the West (𝑥𝑥1 = 0, 𝑥𝑥2 = 0, 𝑥𝑥3 = 0), the model predicts
9
US Academic Physician Salaries: Regional Differences—7
► Resulting regression equation (based on results presented by the authors in the article)
10
US Academic Physician Salaries: Regional Differences—8
► What if you want the mean difference in salaries for physicians in the Northeast
compared to physicians in the Midwest?
11
Systolic Blood Pressure and Ethnicity—1
► Data from National Health and Nutrition Examination Survey (NHANES), 2013–2014
► Data include 10,000+ observations on persons 0–80 years old and 7,172 systolic blood
pressure (SBP) measurements on persons 8–80 years old
12
Systolic Blood Pressure and Ethnicity—2
► Designate one group as the reference group, for example, Mexican-Americans, and make
binary indicators for each of the four other ethnicity categories
► 𝑥𝑥1 = 1 if Hispanic, 0 otherwise
► 𝑥𝑥2 = 1 if Non-Hispanic White, 0 otherwise
► 𝑥𝑥3 = 1 if Non-Hispanic Black, 0 otherwise
► 𝑥𝑥4 = 1 if Other, 0 otherwise
13
Systolic Blood Pressure and Ethnicity—3
► Fit the regression model 𝑦𝑦� = 𝛽𝛽̂𝑜𝑜 + 𝛽𝛽̂1 𝑥𝑥1 + 𝛽𝛽̂2 𝑥𝑥2 + 𝛽𝛽̂3 𝑥𝑥3 + 𝛽𝛽̂4 𝑥𝑥4
► Here, each of the 4 slopes (𝛽𝛽̂1 , 𝛽𝛽̂2 , 𝛽𝛽̂3 , and 𝛽𝛽̂4 ) estimates mean SBP difference between
the group that has a corresponding 𝑥𝑥 value of 1 and the reference group (Mexican-
Americans)
14
Systolic Blood Pressure and Ethnicity—4
► Resulting model:
15
Summary
► Simple linear regression is a method for estimating the relationship between the mean
value of an outcome, 𝑦𝑦, and a predictor 𝑥𝑥1, via a linear equation
► When 𝑥𝑥 is nominal categorical (can also be done with ordinal), designate one category the
reference group and make separate binary 𝑥𝑥’s for all other categories
16
Simple Linear Regression with a
Continuous Predictor
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Learning Objectives
► Interpret the estimated slope and intercept from a simple linear regression model with a
continuous 𝑥𝑥1
2
Arm Circumference and Height—1
► Data on anthropometric measures from a random sample of 150 Nepali children 0–12
months old
► Question: What is the relationship between average arm circumference and height?
► Data:
► Arm circumference: Mean 12.4 cm, SD 1.5 cm, range 7.3 cm–15.6 cm
► Height: Mean 61.6 cm, SD 6.3 cm, range 40.9 cm–73.3 cm
3
Arm Circumference and Height—2
► Approach 1: Dichotomize height at median, compute mean difference (and 95% CI) in arm
circumference for taller children compared to short children (or do as SLR)
► The corresponding simple regression
result:
4
Arm Circumference and Height—3
5
Arm Circumference and Height—4
► Approach 2: Categorize height into four categories by quartile, compare mean arm
circumference via mean differences across height quartiles
► The corresponding simple regression
result:
6
Arm Circumference and Height—5
7
Arm Circumference and Height—6
8
Arm Circumference and Height—7
► What about treating height as continuous when estimating the arm circumference/height
relationship?
9
Arm Circumference and Height—8
► A useful visual display for assessing the nature of association between two continuous
variables: A scatterplot
10
Arm Circumference and Height—9
► A regression can be estimated via computer, of the form 𝑦𝑦� = 𝛽𝛽̂0 + �𝛽𝛽1 𝑥𝑥1 , where 𝑥𝑥1 is
height in cm, and 𝑦𝑦� is the average arm circumference (cm) for a group of children all of
the same height, 𝑥𝑥1
► For these data on 150 Nepalese children less than 12 months old, the estimated
regression line is 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1
11
Arm Circumference and Height—10
12
Arm Circumference and Height—11
► Estimated arm circumference for children 60 cm tall: 𝑦𝑦� = 2.7 + 0.16 60 = 12.3
13
Arm Circumference and Height—12
► Notice, most points don’t fall directly on the line; we are estimating the mean arm
circumference of children 60 cm tall, but observed points vary about the estimated mean
14
Arm Circumference and Height—13
► Recall, for these data on 150 Nepalese children less than 12 months old, the estimated
regression line is 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1
► Here, the estimated slope 𝛽𝛽̂1 = 0.16; what is the interpretation of this?
► 𝛽𝛽̂1 is the average change in arm circumference for a one-unit (1 cm) increase in height
► 𝛽𝛽̂1 is the mean difference in arm circumference for two groups of children who differ
by one unit (1 cm) in height, taller to shorter
► This result estimates that the mean difference in arm circumferences for a one cm
difference in height is 0.16 cm, with taller children having greater average arm
circumference
15
Arm Circumference and Height—14
► This mean difference estimate is constant across the entire height range in the sample:
Definition of a slope of a line
16
Arm Circumference and Height—15
17
Arm Circumference and Height—16
► What is the estimated mean difference in arm circumference for children 70 cm tall versus
children 60 cm tall?
𝑦𝑦�𝑥𝑥1 =70 = 2.7 + 0.16 70
𝑦𝑦�𝑥𝑥1 =60 = 2.7 + 0.16(60)
► And:
18
Arm Circumference and Height—17
19
Arm Circumference and Height—18
► The range of observed heights in the sample is 40.9 cm–73.3 cm: These regression results
only apply to the relationship between arm circumference and height for this height range
20
Arm Circumference and Height—19
► Recall, for these data on 150 Nepalese children less than 12 months old, the estimated
regression line is 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1
► Here, the estimated intercept 𝛽𝛽̂0 = 2.7; What is the interpretation of this?
► 𝛽𝛽̂0 estimate of 𝑦𝑦
� when 𝑥𝑥1 = 0 (in other words, the estimated mean arm circumference
for children 0 cm tall)
► As we noted before, estimates of mean arm circumferences only apply to observed
height range, so the intercept estimates a mean arm circumference for a height group
outside the range of child heights in the sample
► Frequently, the scientific interpretation of the intercept is scientifically meaningless
when 𝑥𝑥1 is a continuous predictor, but this intercept is necessary to fully specify
equation of a line and to make estimates of mean arm circumference for groups of
children with heights in sample range
21
Arm Circumference and Height—20
22
Systolic Blood Pressure and Age—1
► Data from National Health and Nutrition Examination Survey (NHANES), 2013–2014
► Data include 10,000+ observations on persons 0–80 years old and 7,172 systolic blood
pressure measurements on persons 8–80 years old
23
Systolic Blood Pressure and Age—2
24
Systolic Blood Pressure and Age—3
► Scatterplot of systolic blood pressure (SBP, mmHg) versus age (years) with exploratory
LOWESS running mean smoother
25
Systolic Blood Pressure and Age—4
► A regression can be estimated via computer, of the form 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1 , where 𝑥𝑥1 is age
in years, and 𝑦𝑦� is the average SBP (mmHg) for a group of individuals all of the same age,
𝑥𝑥1
► For these data, the estimated regression line is 𝑦𝑦� = 99.52 + 0.48𝑥𝑥1
26
Systolic Blood Pressure and Age—5
► For these data, the estimated regression line is 𝑦𝑦� = 99.52 + 0.48𝑥𝑥1
► Here, the estimated slope 𝛽𝛽̂1 = 0.48; what is the interpretation of this?
► 𝛽𝛽̂1 is the average change in SBP for a one-unit (1 year) increase in age
► 𝛽𝛽̂1 is the mean difference in SBP for two groups of persons who differ by one unit (1
year) in age, older to younger
► This result estimates that the mean difference in SBP for a one-year difference in age is
0.48 mmHg, with older individuals having greater average SBP
27
Systolic Blood Pressure and Age—6
► For these data, the estimated regression line is 𝑦𝑦� = 99.52 + 0.48𝑥𝑥1
► Here, the estimated intercept 𝛽𝛽̂0 = 99.2; what is the interpretation of this?
► 𝛽𝛽̂0 estimate of 𝑦𝑦� when 𝑥𝑥1 = 0; in other words, the estimated mean SBP for persons 0
years old (newborns)
► Because the age range in persons for which SBP was measured is 8 to 80 years, this
estimated mean is not applicable to these data; however, this intercept estimate is
necessary to specify complete relationship
► Frequently, the scientific interpretation of the intercept is scientifically meaningless
when 𝑥𝑥1 is a continuous predictor, but this intercept is necessary to fully specify
equation of line and to make estimates of mean SPB for groups in sample range
28
Systolic Blood Pressure and Age—7
29
Summary
► Simple linear regression is a method for estimating the relationship between the mean
value of an outcome, 𝑦𝑦, and a predictor 𝑥𝑥1, via a linear equation
30
Simple Linear Regression Model:
Estimating the Regression Equation—
Accounting for Uncertainty in the
Estimates
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Learning Objectives
► Creating confidence intervals for linear regression slopes means creating confidence
intervals for mean differences, and the approach is “business as usual”
2
Arm Circumference and Height—1
► So in the last section, we showed the results from several simple linear regression models
► For example, when relating arm circumference to height using a random sample of 150
Nepali children <12 months old, the resulting regression equation was 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1
► This was estimated from the individual arm circumference and height data, using a
computer package
► What is the algorithm to estimate this equation? There must be some algorithm that
will always yield the same results for the same data set, regardless of the computer
package used to fit the regression model
3
Arm Circumference and Height—2
► The algorithm to estimate the equation of the line is called the “least squares” estimation
► The idea is to find the line that gets “closest” to all of the points in the sample
4
Arm Circumference and Height—3
► Each distance is 𝑦𝑦𝑖𝑖 − 𝑦𝑦� = 𝑦𝑦𝑖𝑖 − 𝛽𝛽𝑜𝑜 + 𝛽𝛽1 𝑥𝑥1𝑖𝑖 , and this is computed for each data point in
the sample (all 𝑖𝑖 sample values)
► The distances (discrepancies between each observed 𝑦𝑦-value and the estimated mean
of 𝑦𝑦, 𝑦𝑦,
� are called the residuals from the regression model
5
Estimating Intercept and Slope: The Least Squares Algorithm
► The least squares approach is used to estimate the slope and intercept for a specified
regression line for a given dataset
► The algorithm chooses the values of 𝛽𝛽̂1 and 𝛽𝛽̂0 that minimize the total sum of the squared
residuals for each of the observations in the sample:
𝑛𝑛
► This method also yields standard errors for the slope and intercept estimates, 𝛽𝛽̂1 and 𝛽𝛽̂0
6
95% CIs and p-Values—1
► The standard errors allow for the computation of 95% CIs and p-values for the slope and
intercept
► The random sampling behavior of regression slopes and intercepts is normal (because
these are sample mean differences and sample means, respectively) in “large samples”
and a t-distribution in smaller samples
► Hence, it is “business as usual” for getting 95% CIs and doing hypothesis tests
7
95% CIs and p-Values—2
8
Arm Circumference and Height—4
► When relating arm circumference to height using a random sample of 150 Nepali children
<12 months old, the resulting regression equation was 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1
9
Arm Circumference and Height—5
► 95% CI for population slope 𝛽𝛽1 ► p-Value for population slope 𝛽𝛽1
𝛽𝛽̂1 ± 2𝑆𝑆𝑆𝑆
� 𝛽𝛽̂1 → 0.16 ± 2 0.014 → 𝐻𝐻0 : 𝛽𝛽1 = 0 𝑣𝑣𝑣𝑣. 𝐻𝐻𝐴𝐴 : 𝛽𝛽1 ≠ 0
≈ (0.13 𝑐𝑐𝑐𝑐, 0.19 𝑐𝑐𝑐𝑐)
► Assume null is true and calculate distance
of slope estimate 𝛽𝛽̂1 (from 0) in units of
► 95% CI for intercept 𝛽𝛽0 (for illustration, as standard error
result is not scientifically
𝛽𝛽̂1 0.16
useful/interpretable) 𝑡𝑡 = = ≈ 11.4
�
𝑆𝑆𝑆𝑆(𝛽𝛽)̂ 0.014
𝛽𝛽̂0 ± 2𝑆𝑆𝑆𝑆
� 𝛽𝛽̂0 → 2.7 ± 2 0.88 →
► Translate to a p-value
≈ (0.94 𝑐𝑐𝑐𝑐 , 4.46 𝑐𝑐𝑐𝑐)
► In this example, the p-value is very
small, <0.0001
10
Arm Circumference and Height—6
► Summary of findings:
► This research used simple linear regression to estimate the magnitude of the
association between arm circumference and height in Nepali children less than 12
months old, using data on a random sample of 150
► A statistically significant positive association was found (p<.001)
► The results estimate that two groups of such children who differ by 1 cm in height will
differ on average by 0.16 cm in arm circumference (95% CI 0.13 cm to 0.19 cm)
11
Arm Circumference and Height—7
► Estimate the mean difference in arm circumference for children 70 cm tall compared to
children 60 cm tall, and present a 95% CI for this difference
► From a previous set we know this estimated mean difference is 70 − 60 × 𝛽𝛽̂1 =
10𝛽𝛽̂1 = 10 0.16 = 1.6 𝑐𝑐𝑐𝑐
12
Systolic Blood Pressure and Age—1
► Recall the results using the 7,172 observations from the NHANEs 2013–2014 data relating
systolic blood pressure (SBP) to age (years)
► For these data, the estimated regression line is 𝑦𝑦� = 99.52 + 0.48𝑥𝑥1
13
Systolic Blood Pressure and Age—2
► 95% CI for population slope 𝛽𝛽1 ► p-Value for population slope 𝛽𝛽1
𝛽𝛽̂1 ± 2𝑆𝑆𝑆𝑆
� 𝛽𝛽̂1 → 0.48 ± 2 0.008 → 𝐻𝐻0 : 𝛽𝛽1 = 0 𝑣𝑣𝑣𝑣. 𝐻𝐻𝐴𝐴 : 𝛽𝛽1 ≠ 0
≈ (0.464 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚, 0.496 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚) ► Assume null is true and calculate distance
of slope estimate 𝛽𝛽̂1 (from 0) in units of
► 95% CI for intercept 𝛽𝛽0 ( for illustration, as standard error
result is not scientifically
𝛽𝛽̂1 0.48
useful/interpretable) 𝑡𝑡 = = = 60
�
𝑆𝑆𝑆𝑆(𝛽𝛽)̂ 0.008
𝛽𝛽̂0 ± 2𝑆𝑆𝑆𝑆
� 𝛽𝛽̂0 → 99.52 ± 2 0.35 →
► Translate to a p-value
≈ (98.82 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚, 100.22 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚)
► In this example, the p-value is very
small, <0.0001
14
Systolic Blood Pressure and Age—3
► Summary of findings:
► This research used simple linear regression to estimate the magnitude of the
association between systolic blood pressure (SBP) and age using NHANES 2013–2014
data
► Subjects ranged from 8 to 80 years old
► A statistically significant positive association was found between SBP and age (p<.001)
► The results estimate each additional year of age is associated with a 0.48 mmHg
increase in average SBP (95% CI 0.464 mmHg to 0.496 mmHg)
15
Hospital Length of Stay and Age of Visit—1
► Boxplot
16
Hospital Length of Stay and Age of Visit—2
► The resulting equation for these data (estimated via computer) is 𝑦𝑦� = 2.74 + 2.13𝑥𝑥1 ,
𝑥𝑥1 = 1 for persons >40 years old and 0 for persons ≤40 years old
17
Hospital Length of Stay and Age of Visit—3
► 95% CI for population slope 𝛽𝛽1 ► p-Value for population slope 𝛽𝛽1
𝐻𝐻0 : 𝛽𝛽1 = 0 𝑣𝑣𝑣𝑣. 𝐻𝐻𝐴𝐴 : 𝛽𝛽1 ≠ 0
𝛽𝛽̂1 ± 2𝑆𝑆𝑆𝑆
� 𝛽𝛽̂1 → 2.13 ± 2 0.09 →
≈ (1.95 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑, 2.31 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑) ► Assume null is true and calculate distance
of slope estimate 𝛽𝛽̂1 (from 0) in units of
► 95% CI for intercept 𝛽𝛽0 (here, this is a standard error
useful result!) �1
𝛽𝛽 2.13
𝑡𝑡 = � 𝛽𝛽) � = ≈ 23.7
𝑆𝑆𝑆𝑆( 0.009
𝛽𝛽̂0 ± 2𝑆𝑆𝑆𝑆
� 𝛽𝛽̂0 → 2.74 ± 2 0.08 →
≈ (2.58 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑, 2.90 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑) ► Translate to a p-value
► In this example, the p-value is very
small, <0.0001
18
Summary
► The construction of confidence intervals for linear regression slopes and intercepts is
“business as usual”: take the estimate and add/subtract 2 estimated standard errors for
“large samples”
► In smaller samples, the 95% CI and p-values are based on the t-distribution with 𝑛𝑛 − 2
degrees of freedom, but this detail will be handled by a computer; the interpretations
of the CIs and p-values are the same regardless of sample size
► Confidence intervals for slopes are confidence intervals for mean differences
► Confidence intervals for intercepts are confidence intervals for the mean of 𝑦𝑦 for a
specific group (𝑥𝑥1 = 0): not always relevant when 𝑥𝑥1 is continuous
19
Measuring the Strength of a Linear
Association
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Overview/Learning Objectives
► The slope of a regression line estimates the magnitude and direction of the relationship
between 𝑦𝑦 and 𝑥𝑥1: it encapsulates how much 𝑦𝑦 differs on average with differences in 𝑥𝑥1
► The slope estimate and standard error can be used to address the uncertainty in this
estimate with regards to the true magnitude and direction of the association in the
population from which the sample was taken
► Slopes do not impart any information about how well the regression line fits the data in
the sample; the slope gives no indication of how close the points get to the estimated
regression line
2
Value of Slope Depends on Units of 𝑦𝑦 and 𝑥𝑥1
► The value of a regression slope depends on the units of both 𝑦𝑦 and 𝑥𝑥1 (when 𝑥𝑥1 is
continuous)
► For example, with the arm circumference (AC) and age example, using data on 150
Nepalese children less than 12 months old
► AC in cm, age in years: 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1 (units of slope: cm/year of age)
► AC in cm, age in months: 𝑦𝑦� = 2.7 + 0.013 𝑥𝑥1 (units of slope: cm/months of age)
► AC in inches, age in years: 𝑦𝑦� = 2.7 + 0.41𝑥𝑥1 (units of slope: in/months of age)
► AC in inches, age in months: 𝑦𝑦� = 2.7 + 0.034 𝑥𝑥1 (units of slope: in/months of age)
3
Quantifying the Strength of a Linear (Regression) Relationship—1
► Because the value of the slope is affected by the choice of units for both 𝑦𝑦 and 𝑥𝑥1, the
(absolute) magnitude of the slope does not give any information about the strength of a
linear relationship
► Another quantity that can be estimated via linear regression is the coefficient of
determination, 𝑅𝑅2: this is a number that ranges from 0 to 1, with larger values indicating
“closer fits” of the data points and regression line
4
Arm Circumference and Height—1
� 𝑦𝑦𝑖𝑖 − 𝑦𝑦� 2
𝑖𝑖=1
5
Arm Circumference and Height—2
6
Arm Circumference and Height—3
7
Computation of 𝑅𝑅2
► 𝑅𝑅2 will be calculated by the computer (not by hand); for reference, I am providing the
formula for 𝑅𝑅2 to illustrate the concept
► The proportion of the overall variability in 𝑦𝑦 not explained by taking 𝑥𝑥1 into account (i.e.,
not explained by the linear regression) is given by:
∑𝑛𝑛𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦� 2
8
Interpretation of 𝑅𝑅2
► 𝑅𝑅2 quantifies the proportion of variability explained by explained by taking 𝑥𝑥1 into account
(i.e., explained by the linear regression):
► 0 ≤ 𝑅𝑅2 ≤ 1: the closer 𝑅𝑅2 is to 1, the “stronger” the linear relationship is (i.e., the closer
the individual 𝑦𝑦 values are to their respective mean values (𝑦𝑦’s)
� predicted by 𝑥𝑥1 from 𝑦𝑦� =
𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1
9
Arm Circumference and Height—4
10
Systolic Blood Pressure (SBP) and Age
11
Arm Circumference and Sex
12
What Is a Good 𝑅𝑅2?—1
► As with all other estimates, 𝑅𝑅2 is based on the sample of data, frequently reported
without some recognition of sampling variability, for example, a 95% confidence
interval
► Low 𝑅𝑅2 is not necessarily “bad”: many outcomes can not/will not be fully or close to
fully explained, in terms of variability, by any one single predictor
► The higher the 𝑅𝑅2 value, the better the 𝑥𝑥1 predicts 𝑦𝑦 for individuals in a
sample/population, as individual 𝑦𝑦-values vary less about their estimated means based
on 𝑥𝑥1
13
What Is a Good 𝑅𝑅2?—2
► However, there may be important overall associations between mean of 𝑦𝑦 and 𝑥𝑥1, even
though there is still a lot of individual variability in 𝑦𝑦-values about their means estimated
by 𝑥𝑥1
► In the SBP and age example, age explained an estimated 34% of the variability in SBP
► The association was statistically significant, showing that average SBP is larger for older
persons
► However, for any age (year), there is still substantial variation in the SBPs for individuals
14
𝑅𝑅2’s Companion Statistic, 𝑟𝑟—1
► Another value that measures the strength of the linear relationship, but also includes
information about the direction of the relationship, is the correlation coefficient 𝑟𝑟
► 𝑟𝑟 is the the “properly signed” square root of 𝑅𝑅2: the sign of 𝑟𝑟 corresponds to the sign of
the slope
15
𝑅𝑅2’s Companion Statistic, 𝑟𝑟—2
► Arm circumference and height: The 𝑅𝑅2 for the regression model given by �𝑦𝑦 = 2.7 +
0.16𝑥𝑥1 [𝑦𝑦 = 𝐴𝐴𝐴𝐴 (cm), 𝑥𝑥1 = ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (cm)] is 0.46 (46%)
► The correlation coefficient 𝑟𝑟 is 𝑟𝑟 = + 0.46 ≈ 0.68
► Systolic blood pressure and age: The 𝑅𝑅2 for the regression model given by 𝑦𝑦� = 99.52 +
0.48𝑥𝑥1 [𝑦𝑦 = 𝑆𝑆𝑆𝑆𝑆𝑆 (mmHg), 𝑥𝑥1 = 𝑎𝑎𝑎𝑎𝑎𝑎 (years)] is 0.34 (34%)
► The correlation coefficient 𝑟𝑟 is 𝑟𝑟 = + 0.34 ≈ 0.58
► Arm circumference and female sex: The 𝑅𝑅2 for the regression model given by 𝑦𝑦� = 12.5 +
(−0.13)𝑥𝑥1 [𝑦𝑦 = 𝐴𝐴𝐴𝐴 (cm), 𝑥𝑥1 = 𝑠𝑠𝑠𝑠𝑠𝑠 (1 = 𝐹𝐹)] is 0.002 (0.02%)
► The correlation coefficient 𝑟𝑟 is 𝑟𝑟 = − 0.002 ≈ −0.04
16
Slope Versus 𝑅𝑅2 (and 𝑟𝑟)
► The slope estimates the magnitude and direction of the relationship between 𝑦𝑦 and 𝑥𝑥1
► The slope estimates a mean difference in 𝑦𝑦 for two groups who differ by one unit in 𝑥𝑥1
► The slope will change if the units change for 𝑦𝑦 and/or for 𝑥𝑥
► Larger slopes not indicative of stronger linear association; smaller slopes not indicative
of weaker linear association
17
𝑅𝑅2 Versus 𝑟𝑟
► If you have 𝑅𝑅2, you can almost compute 𝑟𝑟: You also need to know the direction of the
relationship (having the slope will give this)
18
Utility of 𝑟𝑟
► Correlations table
19
Summary
► The correlation coefficient 𝑟𝑟 is the properly signed square root of 𝑅𝑅2 and, hence, provides
information about the direction of the association estimated by the regression
20
Additional Examples
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Arm Circumference and Sex of Child—1
► The resulting equation (estimated via a computer): 𝑦𝑦� = 12.5 + −0.13 𝑥𝑥1
► 𝛽𝛽̂1 = −0.13: The estimated mean difference in arm circumference for female children
compared to male children is −0.13 cm; female children have lower arm circumference by
0.13 cm on average
► 𝛽𝛽̂𝑜𝑜 = 12.5: The mean arm circumference for male children (reference group) is 12.5 cm
► 𝛽𝛽̂𝑜𝑜 + 𝛽𝛽̂1 = 12.5 cm + −0.13 cm = 12.37, the mean arm circumference for female
children
2
Arm Circumference and Sex of Child—2
► The variable 𝑥𝑥1 could instead have been coded as 1 for male children and 0 for female
children, and the following regression line could be fit: �𝑦𝑦 = 𝛽𝛽̂0∗ + 𝛽𝛽̂1∗ 𝑥𝑥1
► (I added a “*” to each beta simply to distinguish these values from the values under
the original coding for 𝑥𝑥1)
► Can you figure out, based on the previous regression results given, what are the values
of 𝛽𝛽̂0∗ and 𝛽𝛽̂1∗ based on the same data?
3
Arm Circumference and Sex of Child—3
► The variable 𝑥𝑥1 could instead have been coded as 1 for male children and 0 for female
children, and the following regression line could be fit: �𝑦𝑦 = 𝛽𝛽̂0∗ + 𝛽𝛽̂1∗ 𝑥𝑥1
4
Systolic Blood Pressure (SBP) and Ethnicity
► Designate one group as the reference group, for example, Mexican Americans, and make
binary indicators for each of the four other ethnicity categories
► 𝑥𝑥1 = 1 if Hispanic, 0 otherwise
► 𝑥𝑥2 = 1 if Non-Hispanic White, 0 otherwise
► 𝑥𝑥3 = 1 if Non-Hispanic Black, 0 otherwise
► 𝑥𝑥4 = 1 if Other, 0 otherwise
5
SBP and Ethnicity—1
► Resulting model:
► What is the mean difference (and 95% CI) in SBP between Non-Hispanic Blacks and
Mexican Americans?
6
SBP and Ethnicity—2
► Resulting model:
► What is the mean difference (and 95% CI) in SBP between Non-Hispanic Blacks and
Non-Hispanic Whites?
7
SBP and Ethnicity—3
► Resulting model:
► What is the estimated mean SBP (and 95% CI) between for Non-Hispanic Whites?
8
Systolic Blood Pressure and Age—1
► Recall the results using the 7,172 observations from the NHANEs 2013–2014 data relating
systolic blood pressure to age (years)
► For these data, the estimated regression line is 𝑦𝑦� = 99.52 + 0.48𝑥𝑥1
9
Systolic Blood Pressure and Age—2
► What is the estimated mean difference (and 95% CI) in SBP for 65 year olds compared to
60 year olds?
10
Systolic Blood Pressure and Age—3
► What is the estimate mean SBP (and 95% CI) for 65 year olds?
11