Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Applied Statistics II-SLR

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

BCBB Workshop

Applied Statistics - II
Qinlu (Claire) Wang
Statistician

Bioinformatics and Computational Biosciences Branch (BCBB)


Office of Cyber Infrastructure and Computational Biology (OCICB)
National Institute of Allergy and Infectious Diseases (NIAID)
https://nih.box.com/v/maliBCBB2020

-> “Biostatistics” folder -> “Applied Statistics II-SLR”


Outline

Simple Linear Regression Model

1. Introduction
2. Assumptions
3. Diagnostics

4. Prediction
Simple Linear Regression
Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
• One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
• The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Find the Height Weight


“Best Fitting Line” 63 127
64 121
66 142
69 157
69 162
71 156
71 169
72 165
73 181
75 208
 For data , …, , where , …, are known constants and are the observed random responses, we
formulate the simple linear regression model as:

 𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑥𝑖 +𝜖 𝑖

 Least squares criterion – Find the values  


𝑛
❑ ❑ ❑ ❑ 2
and to minimize the sum of squares: 𝑆 ( 𝛽 0 , 𝛽 1 )=∑ (𝑌 𝑖 − 𝛽 0 − 𝛽1 𝑥 𝑖 )
𝑖=1

 Least Squares Estimates for and ^


  0=𝑌´ − ^
𝛽 𝛽 1 ´𝑥

 
^
𝛽 1=
∑ (𝑥 𝑖 −
𝑖= 1
𝑥 )(𝑌
´ 𝑖 −𝑌
´ )
𝑛

∑ ( 𝑥❑
𝑖 𝑥 )2
−´
𝑖=1
 
The method of maximum-likelihood finds parameters values and that maximize the
joint density of the independent responses evaluated at (the observed values) is

Minimize this part

In this (special) case, the method of maximum-likelihood gives the same parameter estimates
as the method of least-squares.
Assumptions

 Linearity
  of the data. The relationship between the predictor (x) and the
outcome (y) is assumed to be linear. for
 Normality of residuals. The residual errors , …, are assumed to be normally
distributed.
 Homogeneity of residuals variance. The residuals are assumed to have a
constant variance. for
 Independence of residuals error terms. , …, are independent

𝜖  𝑖 𝑁𝐼𝐷 ( 0 ,𝜎 2 ) ,𝑖 =1 , … ,𝑛
How to check assumptions?
1. Linearity of the data

Ideally, the residual plot will show


no fitted pattern.

If not?
A simple approach is to use non-
linear transformations of the
predictors, such as log(x), sqrt(x)
and x^2, in the regression model.
2. Normality of residuals

Ideally, the normal probability plot


of residuals should approximately
follow a straight line.
3. Homogeneity of variance

Ideally, it’s good if you see a horizontal line with equally spread points.
If not? A possible solution is to use a log or square root transformation of the response (y).
4. Outliers and high levarage points

An outlier is a point that has an extreme


outcome variable value.

An influential value is a value, which inclusion


or exclusion can alter the results of the
regression analysis.
Discussion about assumptions

Having patterns in residuals is not a stop signal.

Potential problems might be:


• A non-linear relationships between the outcome and the predictor variables.
• Existence of important variables that you left out from your model.
• Presence of outliers.
Diagnostics
1. The p Value
2. R-Squared and Adjusted R-Squared

𝑛
  ❑ ❑ 2
  =1− 𝑆𝑆𝐸
𝑅
2 𝑆𝑆𝐸=∑ ( 𝑦 𝑖 − ^𝑦 𝑖 )
𝑆𝑆𝑇 𝑛 =1
𝑛
  ❑ 2
𝑆𝑆𝑇 =∑ ( 𝑦 ❑
𝑖 − 𝑦
´ 𝑖 )
𝑛=1

  2𝑎𝑑𝑗 =1− 𝑀𝑆𝐸   𝑆𝑆𝐸


𝑅 𝑀𝑆𝐸=
𝑀𝑆𝑇 (𝑛 − 𝑞)

  𝑆𝑆𝑇
𝑀𝑆𝑇 =
(𝑛 − 1)
3. Standard Error and F-Statistic
4. AIC and BIC

The Akaike’s information criterion - AIC (Akaike, 1974)


Bayesian information criterion - BIC (Schwarz, 1978) 

 𝐴𝐼𝐶=2 𝑘 −2 × 𝐼𝑛( 𝐿
^)

𝐵𝐼𝐶
  =𝑘 × 𝐼𝑛 ( 𝑛 ) −2 × 𝐼𝑛 ( ^
𝐿)

For model comparison, the model with the lowest AIC and BIC score is preferred.
Prediction

 A confidence interval is a range of values associated with a population 


parameter. For example, the mean of a population.

Standard error of the fit

 A prediction interval is where you expect a future value to fall.

Standard error of the prediction

* More discussion about the formula difference


A prediction interval reflects the uncertainty around a single value, while a confidence
interval reflects the uncertainty around the mean prediction values.
Conclusions

Examples:
• Robust regression
• Multiple Linear Regression
 Regression are much more than these! • Non-linear Regression
 The materials of this training: Mali - BCBB - 2020 • Logistic Regression

 Github: BCBB-Statistical Testing


 If you have any specific statistical problem, please send email to
bioinformatics@niaid.nih.gov
Reference
 Penn State University – STAT Applied Regression Analysis
– https://online.stat.psu.edu/stat462/node/79/
– https://online.stat.psu.edu/stat501/lesson/3/3.3
 Linear Regression Assumptions and Diagnostics in R: Essentials
– http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumption
s-and-diagnostics-in-r-essentials/
 Linear Regression
– ttp://r-statistics.co/Linear-Regression.html
 Duke Statistics – Unit 6: Simple Linear Regression
– http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf
 The University of Sydney – STAT3022 Applied Linear Models Lecture 3
– http://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Lecture/lecture03_2020JC.html#1
Thank You!

You might also like