Applied Statistics II-SLR

BCBB Workshop
Applied Statistics - II
Qinlu (Claire) Wang
Statistician
Bioinformatics and Computational Biosciences Branch (BCBB)

Office of Cyber Infrastructure and Computational Biology (OCICB)
National Institute of Allergy and Infectious Diseases (NIAID)
https://nih.box.com/v/maliBCBB2020
-> “Biostatistics” folder -> “Applied Statistics II-SLR”

Outline
Simple Linear Regression Model
1. Introduction
2. Assumptions
3. Diagnostics
4. Prediction
Simple Linear Regression
Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
• One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
• The other variable, denoted y, is regarded as the response, outcome, or dependent variable.
Find the Height Weight

“Best Fitting Line” 63 127
64 121
66 142
69 157
69 162
71 156
71 169
72 165
73 181
75 208
For data , …, , where , …, are known constants and are the observed random responses, we
formulate the simple linear regression model as:
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑥𝑖 +𝜖 𝑖
Least squares criterion – Find the values

𝑛
❑ ❑ ❑ ❑ 2
and to minimize the sum of squares: 𝑆 ( 𝛽 0 , 𝛽 1 )=∑ (𝑌 𝑖 − 𝛽 0 − 𝛽1 𝑥 𝑖 )
𝑖=1
Least Squares Estimates for and ^

0=𝑌´ − ^
𝛽 𝛽 1 ´𝑥

^
𝛽 1=
∑ (𝑥 𝑖 −
𝑖= 1
𝑥 )(𝑌
´ 𝑖 −𝑌
´ )
𝑛
∑ ( 𝑥❑
𝑖 𝑥 )2
−´
𝑖=1

The method of maximum-likelihood finds parameters values and that maximize the
joint density of the independent responses evaluated at (the observed values) is
Minimize this part
In this (special) case, the method of maximum-likelihood gives the same parameter estimates
as the method of least-squares.
Assumptions
 Linearity
of the data. The relationship between the predictor (x) and the
outcome (y) is assumed to be linear. for
 Normality of residuals. The residual errors , …, are assumed to be normally
distributed.
 Homogeneity of residuals variance. The residuals are assumed to have a
constant variance. for
 Independence of residuals error terms. , …, are independent
𝜖 𝑖 𝑁𝐼𝐷 ( 0 ,𝜎 2 ) ,𝑖 =1 , … ,𝑛
How to check assumptions?
1. Linearity of the data
Ideally, the residual plot will show

no fitted pattern.
If not?
A simple approach is to use non-
linear transformations of the
predictors, such as log(x), sqrt(x)
and x^2, in the regression model.
2. Normality of residuals
Ideally, the normal probability plot

of residuals should approximately
follow a straight line.
3. Homogeneity of variance
Ideally, it’s good if you see a horizontal line with equally spread points.
If not? A possible solution is to use a log or square root transformation of the response (y).
4. Outliers and high levarage points
An outlier is a point that has an extreme

outcome variable value.
An influential value is a value, which inclusion

or exclusion can alter the results of the
regression analysis.
Discussion about assumptions
Having patterns in residuals is not a stop signal.
Potential problems might be:

• A non-linear relationships between the outcome and the predictor variables.
• Existence of important variables that you left out from your model.
• Presence of outliers.
Diagnostics
1. The p Value
2. R-Squared and Adjusted R-Squared
𝑛
❑ ❑ 2
=1− 𝑆𝑆𝐸
𝑅
2 𝑆𝑆𝐸=∑ ( 𝑦 𝑖 − ^𝑦 𝑖 )
𝑆𝑆𝑇 𝑛 =1
𝑛
❑ 2
𝑆𝑆𝑇 =∑ ( 𝑦 ❑
𝑖 − 𝑦
´ 𝑖 )
𝑛=1
2𝑎𝑑𝑗 =1− 𝑀𝑆𝐸 𝑆𝑆𝐸

𝑅 𝑀𝑆𝐸=
𝑀𝑆𝑇 (𝑛 − 𝑞)
𝑆𝑆𝑇
𝑀𝑆𝑇 =
(𝑛 − 1)
3. Standard Error and F-Statistic
4. AIC and BIC
The Akaike’s information criterion - AIC (Akaike, 1974)

Bayesian information criterion - BIC (Schwarz, 1978)
𝐴𝐼𝐶=2 𝑘 −2 × 𝐼𝑛( 𝐿
^)
𝐵𝐼𝐶
=𝑘 × 𝐼𝑛 ( 𝑛 ) −2 × 𝐼𝑛 ( ^
𝐿)
For model comparison, the model with the lowest AIC and BIC score is preferred.
Prediction
 A confidence interval is a range of values associated with a population

parameter. For example, the mean of a population.
Standard error of the fit
 A prediction interval is where you expect a future value to fall.
Standard error of the prediction
* More discussion about the formula difference

A prediction interval reflects the uncertainty around a single value, while a confidence
interval reflects the uncertainty around the mean prediction values.
Conclusions
Examples:
• Robust regression
• Multiple Linear Regression
 Regression are much more than these! • Non-linear Regression
 The materials of this training: Mali - BCBB - 2020 • Logistic Regression
 Github: BCBB-Statistical Testing

 If you have any specific statistical problem, please send email to
bioinformatics@niaid.nih.gov
Reference
 Penn State University – STAT Applied Regression Analysis
– https://online.stat.psu.edu/stat462/node/79/
– https://online.stat.psu.edu/stat501/lesson/3/3.3
 Linear Regression Assumptions and Diagnostics in R: Essentials
– http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumption
s-and-diagnostics-in-r-essentials/
 Linear Regression
– ttp://r-statistics.co/Linear-Regression.html
 Duke Statistics – Unit 6: Simple Linear Regression
– http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf
 The University of Sydney – STAT3022 Applied Linear Models Lecture 3
– http://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Lecture/lecture03_2020JC.html#1
Thank You!

Applied Statistics II-SLR

Uploaded by

Copyright:

Available Formats

Applied Statistics II-SLR

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Statistics II-SLR

Uploaded by

Copyright:

Available Formats

BCBB Workshop

Bioinformatics and Computational Biosciences Branch (BCBB)

-> “Biostatistics” folder -> “Applied Statistics II-SLR”

Simple Linear Regression Model

Find the Height Weight

Least squares criterion – Find the values

Least Squares Estimates for and ^

Minimize this part

Ideally, the residual plot will show

Ideally, the normal probability plot

An outlier is a point that has an extreme

An influential value is a value, which inclusion

Having patterns in residuals is not a stop signal.

Potential problems might be:

2𝑎𝑑𝑗 =1− 𝑀𝑆𝐸 𝑆𝑆𝐸

The Akaike’s information criterion - AIC (Akaike, 1974)

 A confidence interval is a range of values associated with a population

Standard error of the fit

 A prediction interval is where you expect a future value to fall.

Standard error of the prediction

* More discussion about the formula difference

 Github: BCBB-Statistical Testing

You might also like