Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Simple Linear Regression

This document discusses key concepts in simple linear regression analysis including: 1) Residuals represent the distance between observed data points and the regression line, and are used to calculate sums of squares. 2) ANOVA tables are used to partition total variability into explained and unexplained components to assess model fit. Measures like the standard error of the estimate and R-squared indicate how well data fits the regression line. 3) Hypothesis testing and confidence intervals can be used to make inferences about the slope and intercept coefficients in the simple linear regression model.

Uploaded by

Lincoln
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Simple Linear Regression

This document discusses key concepts in simple linear regression analysis including: 1) Residuals represent the distance between observed data points and the regression line, and are used to calculate sums of squares. 2) ANOVA tables are used to partition total variability into explained and unexplained components to assess model fit. Measures like the standard error of the estimate and R-squared indicate how well data fits the regression line. 3) Hypothesis testing and confidence intervals can be used to make inferences about the slope and intercept coefficients in the simple linear regression model.

Uploaded by

Lincoln
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Distance between the best fitted line to the data point = residual

When correlation is not zero, we can estimate y value with x

Anova table → find out the total variability of the regression model
Mean between data and mean of data (yi-y mean)

SST (total sum of squares) → Σ (y-ȳ)^2


SSR regression sum of squares (explained) → Σ (ŷ-ȳ)^2
SSE Sum of squared errors (unexplained) → Σ(y-ŷ)^2

SST=RSS+SSE (total sum of squares = regression sum of squares + sum of squared errors)
The least square method helps us to find a line to minimise SSE
ANOVA table → how fitting is the regression model fitting to our observed data
Measure to fit of our model:
Approach 1: standard error of estimate (SEE, Standard Error of Estimate, SD of errors,s )
s=SEE= sqrt (MSE)= Sqrt (SSE/n-2 )

Approach 2: coefficient of determination (R^2)


R^2=1-SSE/Syy=(Syy-SSE)/Syy
= reduction in the sum of square error due to x / sum of square error using ŷ=ȳ

F stat = MSR/MSE

Indications
Observed data is close to the regression line, SSE low, SEE is also low
R^2 is high → if R^2 is high and SEE is low → a good indication where the model is good fit (high confidence of the
estimate)

Observed data is far to the regression line, SEE is high while R^2 is low, regression is a poor fit

When R^2 = 1, SSE must be equal to 0, i.e all the points fall on a straight line

Simple linear regression model and its properties


𝑆𝑦
→estimate of slope= 𝐵1= r 𝑆𝑥 =Sxy/Sxx r = Cov(x,y)/𝑠𝑥𝑠𝑦 Cov(x,y)=∑(xi-𝑥)(yi-𝑦)/ (n-1)

F stat = MSR/MSE
Test for population coefficient of correlation

Residuals

Stat model assumes that for SLR,


for each value of x, the value of y
is normally distributed with some
mean (that depends on x linearly),
And a SD that does not depend on x.
The sd
is constant and same for all values
Inference on the slope coefficient (hypothesis testing) s / 𝑆𝑥𝑥= 𝑆𝐸(𝑏i), (Estimated Standard error for βi, (given in
Summary output, SE Coef)

𝑠𝑒=s=standard error of estimate = 𝑀𝑆𝐸= 𝑆𝑆𝐸/(𝑛 − 2) (standard error of regression)

ii) Hypothesis test for the slope 𝑏1 and the intercept 𝑏0


- 2-sided test: 𝐻0: 𝑏𝑖= b* VS 𝐻𝑎: 𝑏𝑖 ≠ b*, for any hypothesized value b*
𝑏𝑖− 𝑏*
→ Observed test statistic (t-stat) : t= (~𝑇𝑛−2 under 𝐻0)
𝑆𝐸(𝑏𝑖)

→ Reject 𝐻0 if |t| > 𝑡α/2,𝑛−2 or p-value= 2P(𝑇𝑛−2≥|t|) < α


- 1-sided test:
→ Reject 𝐻0if t> 𝑡α,𝑛−2 or P (𝑇𝑛−2≥t ) < α for 𝐻α: 𝑏𝑖 > b* ; t< 𝑡α,𝑛−2 or P (𝑇𝑛−2≤ t ) < α for 𝐻α: 𝑏𝑖 < b*
Inferences in SLR: Reject the claim (i.e., 𝐻0 below) that the parameter (𝑏0 or 𝑏1) in SLR equals any value b* with 5% chance of committing
Type 1 error ;
𝐻0: 𝑏𝑖=b* vs 𝐻α:𝑏𝑖≠𝑏 *
If any of the following is correct: (reject null hypothesis which is the researcher claim)
1) The absolute value of the t-statistic is larger than 𝑡0.025,𝑛−2≈2;
2) The p-value computed from the t-statistic is less than 0.05; or
3) b* lies outside the 95% ci for the parameter 𝑏𝑖
Even if the errors are not normally distributed, we can apply hypothesis testing when n>30 (central limit theorem)
Interference with confidence interval
sx= sample SD of xo , s=se= 𝑀𝑆𝐸= 𝑆𝑆𝐸/(𝑛 − 2) (standard error of regression)

Coefficient of determination (R^2)


2 2 𝑆𝑦𝑦−𝑆𝑆𝐸
𝑅 =[𝑐𝑜𝑟𝑟(𝑋, 𝑌)] = SSR/ (SSR+SSE) ; = 𝑆𝑦𝑦
= 1-RSS/TSS

2
(𝑆𝑦𝑦=∑(𝑦𝑖 −𝑦) ) ** (Syy)=sy^2 (n-1)

95% C.I. for the parameter b in SLR: 𝑏1± 𝑡0.025,𝑛−2𝑆𝐸(𝑏𝑖) ; 𝑡0.025,𝑛−2


1

How to explain if prediction is reliable


1. R^2 is 30% , meaning that 70% of the variation in y remains
unexplained.
2. standard error of estimate (SEE, Standard Error of Estimate, SD
of errors,s ) , s=SEE= sqrt (MSE)= Sqrt (SSE/n-2

Prediction interval -> CI for new observation

CH 5 Confidence interval T distribution (mount shaped, symmetric about 0, fatter tail than standard
Normal, larger df → closer to standard
Normal )

A point estimator→ a single value that


Estimate an unknown population
parameter

Empirical rule 要符合percentages as well, not only bell shaped

You might also like