Data Science 03 - Regression PDF

CIT 651: Introduction to Machine
Learning and Statistical Analysis

INAS A. YASSINE, PHD
REVIEW
HYPOTHESIS TESTING
Estimators and Sampling Errors
Some Terminology
• Estimator – a statistic derived from a sample to infer the value of

a population parameter; e.g. ‘average’; ‘variance’;…
• Estimate – the value of the estimator in a particular sample.
• Remember: The estimate is a random variable.
Sampling Errors
Confidence Interval for a Mean (µ) with
unknown s
• Use the Student’s t distribution instead of the normal distribution when the
population is normal but the standard deviation s is unknown and the sample size
is small.
• The confidence interval for µ

(unknown s) can be rewritten as
Note: The degrees of freedom for the t distribution is n – 1.

EXAMPLE 1: KNOWN POPULATION STANDARD DEVIATION-
A website has a number of monthly visitors that follows normal dist. with mean=200k and a
std dev=16k. Recently, a marketing campaign have been done and the records show an
average of 203.5k visitors. Is there a statistically significant change in the number of visitors.
Step 1: State the hypotheses Step 5: Make a decision and interpret the
result.
H0: µ = 200
H1: µ ≠ 200 (why two-tailed?) Ø Because 1.55 does not fall in the rejection
region, H0 is not rejected.
Step 2: Select the level of significance. Ø We conclude that there is no sufficient
α = 0.01 as stated in the problem evidence that there is an increase in the
number of visitors above 200.
Step 3: Select the test statistic.
Use Z-distribution since σ is known
Step 4: Formulate the decision rule.
Reject H0 if |Z| > Za/
Z > Za / 2
X -µ
> Za / 2
s/ n Can you repeat the test to determine
203.5 - 200 whether there is an increase in the
> Z .01/ 2
16 / 50 #visitors?
1.55 is not > 2.58
SIMPLE LINEAR REGRESSION
Outline
1 Visualization and Analysis of Correlation

2 Simple Regression of Y on X
3 Ordinary Least Squares Formulae
4 Tests for Significance
5 Confidence and Prediction Intervals for Y
6 Residual Analysis
1 VISUALIZATION AND ANALYSIS OF
CORRELATION
Visual Displays
Ø Begin the analysis of bivariate data with a scatter plot.
Ø A scatter plot indicates (visually) the strength of the
relationship between the two variables.
CORRELATION
Correlation Coefficient
The sample correlation coefficient (r) measures the degree of linearity
in the relationship between X and Y.
-1 ≤ r ≤ +1
r = 0 indicates no linear relationship.
*Note:
r is an estimate of the population correlation
coefficient r
CORRELATION
Strong Positive Correlation Weak Positive Correlation Weak Negative Correlation
Strong Negative Correlation No Correlation Nonlinear Relation

Ø Because r is an estimate of the population correlation,
it is a random variable that depends on the sample.
Ø So, is there a statistical test that can assist us to

determine whether two variables are correlated or not?
Ø In other words, can we test the hypothesis:

H0: r=0
CORRELATION
Step 1: State the Hypotheses.
Determine whether you are using a one or two-tailed test and the level
of significance (a).
H0: r = 0
H1: r ≠ 0
Step 2: Specify the Decision Rule.
For degrees of freedom d.f. = n -2, look up the critical value ta.
Step 3: Calculate the Test Statistic:
Step 4: Make the Decision.

If the sample correlation coefficient r exceeds the critical value ra, then
reject H0.
2 SIMPLE REGRESSION
• Simple Regression analyzes the relationship between two

variables.
• It specifies one dependent (response) variable, Y,
and one independent (predictor) variable, X.
• This hypothesized relationship here will be linear.
• Please, notice that:
• This is NOT cause-effect relationship
• Prediction is accurate within the measured range of X (but not outside it)
2 SIMPLE REGRESSION
Models and Parameters

• The assumed model for a linear relationship is y = b0 + b1x + e.
• The relationship holds for all pairs (xi , yi ).
• The error term is assumed normally distributed with mean of 0 and
standard deviation s.
• The unknown parameters (to be estimated) are
b0 Intercept
b1 Slope
• The fitted model used to predict the expected value of Y for a

given value of X is
• The fitted coefficients are

b0 the estimated intercept
b1 the estimated slope
2 SIMPLE REGRESSION

• For a given measurement, x, the value of Y is random.
• The best estimate of the random variable Y, is its mean: E(Y|X=x)
• The problem can be formulated as: Given x, what is the expected
value of Y.
2 SIMPLE REGRESSION

• The idea is to estimate a line (equation) that minimizes the least
square error (of the measured samples).
3 ORDINARY LEAST SQUARES FORMULAS
Slope and Intercept

• It can be shown that the slope and intercept of the regression line
can be estimated using the following equations.
Residual (error):
The sum of squares error is given by,
The variance of the error, se2, is estimated by,

3 ORDINARY LEAST SQUARES FORMULAS
Notice that (assessing the fit):

• The squared correlation coefficient r2 can be used as a measure of
the linear fitting of Y on X.
with r2 = 1 (i.e., 100%) indicates perfect fit.
• Since, the estimated parameters, b0 and b1 are random (depend
on the sample), they have variance:
• Also, we can estimate a confidence interval on them.

4. TEST FOR SIGNIFICANCE
Confidence Intervals for Slope and Intercept

• Confidence interval for the true slope and intercept:
Hypothesis Tests
• If b1 = 0, then X cannot influence Y and the regression model
collapses to a constant b0 plus random error.
• The hypotheses to be tested are:

d.f. = n -2
Reject H0 if tcalc > ta/2
or if p-value £ a.
Hypothesis Tests
• Examples of not rejecting the hypothesis:
• Examples of rejecting the hypothesis:

6. RESIDUAL ANALYSIS
Three Important Assumptions
1. The errors are normally distributed.
2. The errors have constant variance (i.e., they are homoscedastic).
3. The errors are independent (i.e., they are non-autocorrelated).
6. RESIDUAL ANALYSIS
a) Satisfactory residual… assumptions are ok.
b) Unsatisfactory residual (variance increases with magnitude of x).
c) Unsatisfactory residual.
d) Unsatisfactory residual: nonlinear
LAB
1. Load the data from the R library:

the_data = as.data.frame( EuStockMarkets );
2. Do linear regression of FTSE on DAX:
the_output = lm(the_data$FTSE ~ the_data$DAX) ;
3. Display the results:
summary(the_output) ; plot(the_output$fitted.values-the_data$FTSE)
Ø Plot the residuals:

plot( the_data$FTSE - the_output$fitted.values)
or
the_residual=resid(the_output); plot(the_residual)
Ø Plot the estimated values versus DAX:

plot( the_data$DAX, the_output$fitted.values ) ; abline(the_output, col="red")
or
estVal=the_ourput$coefficients[1] + the_output$coefficients[2] * the_data$DAX
plot (estVal)
ASSIGNMENT
Given the following data:
1. Determine the regression of cholesterol level on age

2. Plot the scatter diagram and the regression line.
3. Pot the residual error versus age.
4. Determine the goodness of fit by calculating r2. Comment on
the results.
5. Compute the standard deviation of the errors, ε.
6. Construct a 95% confidence interval for b1.
7. Test at the 5% significance level if b1 is positive.
8. Using α = .025, can you conclude that the linear correlation
coefficient is positive?
END OF LECTURE
THANK YOU…
ADVANCED:
CONFIDENCE AND PREDICTION INTERVALS FOR Y
How to Construct an Interval Estimate for Y
• Confidence Interval for the conditional mean of Y.

• Prediction intervals are wider than confidence intervals because individual
Y values vary more than the mean of Y.
7. UNUSUAL OBSERVATIONS
Standardized Residuals
• One can use R to compute standardized residuals.
• If the absolute value of any standardized residual is at least 2, then it is classified
as unusual.
Leverage and Influence

• A high leverage statistic indicates the observation is far from the mean of X.
• These observations are influential because they are at the “ end of the lever.”
• The leverage for observation i is denoted hi.
A leverage that exceeds

3/n is unusual.
8. OTHER REGRESSION PROBLEMS
Outliers
Outliers may be caused by To fix the problem,

- an error in recording - delete the observation(s)
data - delete the data
- impossible data - formulate a multiple regression model
- an observation that has that includes the lurking variable.
been influenced by an
unspecified “lurking”
variable that should
have been controlled
but wasn’t.
12B-29
8. Other Regression Problems
Model Misspecification
• If a relevant predictor has been omitted, then the model is misspecified.
• Use multiple regression instead of bivariate regression.
Ill-Conditioned Data
• Well-conditioned data values are of the same general order of magnitude.
• Ill-conditioned data have unusually large or small data values and can cause loss
of regression accuracy or awkward estimates.
• Avoid mixing magnitudes by adjusting the magnitude of your data before running
the regression.
8. Other Regression Problems
Spurious Correlation
• In a spurious correlation two variables appear related because of the way they are
defined.
• This problem is called the size effect or problem of totals.
Model Form and Variable Transforms

• Sometimes a nonlinear model is a better fit than a linear model.
• Variables may be transformed (e.g., logarithmic or exponential functions) in order
to provide a better fit.
• Log transformations reduce heteroscedasticity.
• Nonlinear models may be difficult to interpret.
SOLUTION

Data Science 03 - Regression PDF

Uploaded by

Copyright:

Available Formats

Data Science 03 - Regression PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science 03 - Regression PDF

Uploaded by

Copyright:

Available Formats

CIT 651: Introduction to Machine

Learning and Statistical Analysis

• Estimator – a statistic derived from a sample to infer the value of

• The confidence interval for µ

Note: The degrees of freedom for the t distribution is n – 1.

1 Visualization and Analysis of Correlation

Strong Positive Correlation Weak Positive Correlation Weak Negative Correlation

Strong Negative Correlation No Correlation Nonlinear Relation

Ø So, is there a statistical test that can assist us to

Ø In other words, can we test the hypothesis:

Step 3: Calculate the Test Statistic:

Step 4: Make the Decision.

• Simple Regression analyzes the relationship between two

Models and Parameters

• The fitted model used to predict the expected value of Y for a

• The fitted coefficients are

Models and Parameters

Models and Parameters

Slope and Intercept

The variance of the error, se2, is estimated by,

Notice that (assessing the fit):

• Also, we can estimate a confidence interval on them.

Confidence Intervals for Slope and Intercept

• The hypotheses to be tested are:

• Examples of rejecting the hypothesis:

1. Load the data from the R library:

Ø Plot the residuals:

Ø Plot the estimated values versus DAX:

1. Determine the regression of cholesterol level on age

How to Construct an Interval Estimate for Y

• Confidence Interval for the conditional mean of Y.

Leverage and Influence

A leverage that exceeds

Outliers may be caused by To fix the problem,

Model Form and Variable Transforms

You might also like