Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Science 03 - Regression PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

CIT 651: Introduction to Machine

Learning and Statistical Analysis


INAS A. YASSINE, PHD
REVIEW
HYPOTHESIS TESTING
Estimators and Sampling Errors

Some Terminology

• Estimator – a statistic derived from a sample to infer the value of


a population parameter; e.g. ‘average’; ‘variance’;…
• Estimate – the value of the estimator in a particular sample.
• Remember: The estimate is a random variable.
Sampling Errors
Confidence Interval for a Mean (µ) with
unknown s

• Use the Student’s t distribution instead of the normal distribution when the
population is normal but the standard deviation s is unknown and the sample size
is small.

• The confidence interval for µ


(unknown s) can be rewritten as

Note: The degrees of freedom for the t distribution is n – 1.


EXAMPLE 1: KNOWN POPULATION STANDARD DEVIATION-
A website has a number of monthly visitors that follows normal dist. with mean=200k and a
std dev=16k. Recently, a marketing campaign have been done and the records show an
average of 203.5k visitors. Is there a statistically significant change in the number of visitors.

Step 1: State the hypotheses Step 5: Make a decision and interpret the
result.
H0: µ = 200
H1: µ ≠ 200 (why two-tailed?) Ø Because 1.55 does not fall in the rejection
region, H0 is not rejected.
Step 2: Select the level of significance. Ø We conclude that there is no sufficient
α = 0.01 as stated in the problem evidence that there is an increase in the
number of visitors above 200.
Step 3: Select the test statistic.
Use Z-distribution since σ is known
Step 4: Formulate the decision rule.
Reject H0 if |Z| > Za/
Z > Za / 2
X -µ
> Za / 2
s/ n Can you repeat the test to determine
203.5 - 200 whether there is an increase in the
> Z .01/ 2
16 / 50 #visitors?
1.55 is not > 2.58
SIMPLE LINEAR REGRESSION

Outline

1 Visualization and Analysis of Correlation


2 Simple Regression of Y on X
3 Ordinary Least Squares Formulae
4 Tests for Significance
5 Confidence and Prediction Intervals for Y
6 Residual Analysis
1 VISUALIZATION AND ANALYSIS OF
CORRELATION

Visual Displays
Ø Begin the analysis of bivariate data with a scatter plot.
Ø A scatter plot indicates (visually) the strength of the
relationship between the two variables.
1 VISUALIZATION AND ANALYSIS OF
CORRELATION

Correlation Coefficient
The sample correlation coefficient (r) measures the degree of linearity
in the relationship between X and Y.

-1 ≤ r ≤ +1
r = 0 indicates no linear relationship.

*Note:
r is an estimate of the population correlation
coefficient r
1 VISUALIZATION AND ANALYSIS OF
CORRELATION

Strong Positive Correlation Weak Positive Correlation Weak Negative Correlation

Strong Negative Correlation No Correlation Nonlinear Relation


Ø Because r is an estimate of the population correlation,
it is a random variable that depends on the sample.

Ø So, is there a statistical test that can assist us to


determine whether two variables are correlated or not?

Ø In other words, can we test the hypothesis:


H0: r=0
1 VISUALIZATION AND ANALYSIS OF
CORRELATION
Step 1: State the Hypotheses.
Determine whether you are using a one or two-tailed test and the level
of significance (a).
H0: r = 0
H1: r ≠ 0
Step 2: Specify the Decision Rule.
For degrees of freedom d.f. = n -2, look up the critical value ta.

Step 3: Calculate the Test Statistic:

Step 4: Make the Decision.


If the sample correlation coefficient r exceeds the critical value ra, then
reject H0.
2 SIMPLE REGRESSION

• Simple Regression analyzes the relationship between two


variables.
• It specifies one dependent (response) variable, Y,
and one independent (predictor) variable, X.
• This hypothesized relationship here will be linear.
• Please, notice that:
• This is NOT cause-effect relationship
• Prediction is accurate within the measured range of X (but not outside it)
2 SIMPLE REGRESSION

Models and Parameters


• The assumed model for a linear relationship is y = b0 + b1x + e.
• The relationship holds for all pairs (xi , yi ).
• The error term is assumed normally distributed with mean of 0 and
standard deviation s.
• The unknown parameters (to be estimated) are
b0 Intercept
b1 Slope

• The fitted model used to predict the expected value of Y for a


given value of X is

• The fitted coefficients are


b0 the estimated intercept
b1 the estimated slope
2 SIMPLE REGRESSION

Models and Parameters


• For a given measurement, x, the value of Y is random.
• The best estimate of the random variable Y, is its mean: E(Y|X=x)
• The problem can be formulated as: Given x, what is the expected
value of Y.
2 SIMPLE REGRESSION

Models and Parameters


• The idea is to estimate a line (equation) that minimizes the least
square error (of the measured samples).
3 ORDINARY LEAST SQUARES FORMULAS

Slope and Intercept


• It can be shown that the slope and intercept of the regression line
can be estimated using the following equations.

Residual (error):
The sum of squares error is given by,

The variance of the error, se2, is estimated by,


3 ORDINARY LEAST SQUARES FORMULAS

Notice that (assessing the fit):


• The squared correlation coefficient r2 can be used as a measure of
the linear fitting of Y on X.
with r2 = 1 (i.e., 100%) indicates perfect fit.
• Since, the estimated parameters, b0 and b1 are random (depend
on the sample), they have variance:

• Also, we can estimate a confidence interval on them.


4. TEST FOR SIGNIFICANCE

Confidence Intervals for Slope and Intercept


• Confidence interval for the true slope and intercept:
4. TEST FOR SIGNIFICANCE

Hypothesis Tests
• If b1 = 0, then X cannot influence Y and the regression model
collapses to a constant b0 plus random error.

• The hypotheses to be tested are:


d.f. = n -2
Reject H0 if tcalc > ta/2
or if p-value £ a.
4. TEST FOR SIGNIFICANCE
Hypothesis Tests
• Examples of not rejecting the hypothesis:

• Examples of rejecting the hypothesis:


6. RESIDUAL ANALYSIS
Three Important Assumptions
1. The errors are normally distributed.
2. The errors have constant variance (i.e., they are homoscedastic).
3. The errors are independent (i.e., they are non-autocorrelated).
6. RESIDUAL ANALYSIS
a) Satisfactory residual… assumptions are ok.
b) Unsatisfactory residual (variance increases with magnitude of x).
c) Unsatisfactory residual.
d) Unsatisfactory residual: nonlinear
LAB

1. Load the data from the R library:


the_data = as.data.frame( EuStockMarkets );
2. Do linear regression of FTSE on DAX:
the_output = lm(the_data$FTSE ~ the_data$DAX) ;
3. Display the results:
summary(the_output) ; plot(the_output$fitted.values-the_data$FTSE)

Ø Plot the residuals:


plot( the_data$FTSE - the_output$fitted.values)
or
the_residual=resid(the_output); plot(the_residual)

Ø Plot the estimated values versus DAX:


plot( the_data$DAX, the_output$fitted.values ) ; abline(the_output, col="red")
or
estVal=the_ourput$coefficients[1] + the_output$coefficients[2] * the_data$DAX
plot (estVal)
ASSIGNMENT
Given the following data:

1. Determine the regression of cholesterol level on age


2. Plot the scatter diagram and the regression line.
3. Pot the residual error versus age.
4. Determine the goodness of fit by calculating r2. Comment on
the results.
5. Compute the standard deviation of the errors, ε.
6. Construct a 95% confidence interval for b1.
7. Test at the 5% significance level if b1 is positive.
8. Using α = .025, can you conclude that the linear correlation
coefficient is positive?
END OF LECTURE
THANK YOU…
ADVANCED:
CONFIDENCE AND PREDICTION INTERVALS FOR Y

How to Construct an Interval Estimate for Y

• Confidence Interval for the conditional mean of Y.


• Prediction intervals are wider than confidence intervals because individual
Y values vary more than the mean of Y.
7. UNUSUAL OBSERVATIONS
Standardized Residuals
• One can use R to compute standardized residuals.
• If the absolute value of any standardized residual is at least 2, then it is classified
as unusual.

Leverage and Influence


• A high leverage statistic indicates the observation is far from the mean of X.
• These observations are influential because they are at the “ end of the lever.”
• The leverage for observation i is denoted hi.

A leverage that exceeds


3/n is unusual.
8. OTHER REGRESSION PROBLEMS

Outliers

Outliers may be caused by To fix the problem,


- an error in recording - delete the observation(s)
data - delete the data
- impossible data - formulate a multiple regression model
- an observation that has that includes the lurking variable.
been influenced by an
unspecified “lurking”
variable that should
have been controlled
but wasn’t.

12B-29
8. Other Regression Problems

Model Misspecification
• If a relevant predictor has been omitted, then the model is misspecified.
• Use multiple regression instead of bivariate regression.

Ill-Conditioned Data
• Well-conditioned data values are of the same general order of magnitude.
• Ill-conditioned data have unusually large or small data values and can cause loss
of regression accuracy or awkward estimates.
• Avoid mixing magnitudes by adjusting the magnitude of your data before running
the regression.
8. Other Regression Problems

Spurious Correlation
• In a spurious correlation two variables appear related because of the way they are
defined.
• This problem is called the size effect or problem of totals.

Model Form and Variable Transforms


• Sometimes a nonlinear model is a better fit than a linear model.
• Variables may be transformed (e.g., logarithmic or exponential functions) in order
to provide a better fit.
• Log transformations reduce heteroscedasticity.
• Nonlinear models may be difficult to interpret.
SOLUTION

You might also like