Data Science 03 - Regression PDF
Data Science 03 - Regression PDF
Data Science 03 - Regression PDF
Some Terminology
• Use the Student’s t distribution instead of the normal distribution when the
population is normal but the standard deviation s is unknown and the sample size
is small.
Step 1: State the hypotheses Step 5: Make a decision and interpret the
result.
H0: µ = 200
H1: µ ≠ 200 (why two-tailed?) Ø Because 1.55 does not fall in the rejection
region, H0 is not rejected.
Step 2: Select the level of significance. Ø We conclude that there is no sufficient
α = 0.01 as stated in the problem evidence that there is an increase in the
number of visitors above 200.
Step 3: Select the test statistic.
Use Z-distribution since σ is known
Step 4: Formulate the decision rule.
Reject H0 if |Z| > Za/
Z > Za / 2
X -µ
> Za / 2
s/ n Can you repeat the test to determine
203.5 - 200 whether there is an increase in the
> Z .01/ 2
16 / 50 #visitors?
1.55 is not > 2.58
SIMPLE LINEAR REGRESSION
Outline
Visual Displays
Ø Begin the analysis of bivariate data with a scatter plot.
Ø A scatter plot indicates (visually) the strength of the
relationship between the two variables.
1 VISUALIZATION AND ANALYSIS OF
CORRELATION
Correlation Coefficient
The sample correlation coefficient (r) measures the degree of linearity
in the relationship between X and Y.
-1 ≤ r ≤ +1
r = 0 indicates no linear relationship.
*Note:
r is an estimate of the population correlation
coefficient r
1 VISUALIZATION AND ANALYSIS OF
CORRELATION
Residual (error):
The sum of squares error is given by,
Hypothesis Tests
• If b1 = 0, then X cannot influence Y and the regression model
collapses to a constant b0 plus random error.
Outliers
12B-29
8. Other Regression Problems
Model Misspecification
• If a relevant predictor has been omitted, then the model is misspecified.
• Use multiple regression instead of bivariate regression.
Ill-Conditioned Data
• Well-conditioned data values are of the same general order of magnitude.
• Ill-conditioned data have unusually large or small data values and can cause loss
of regression accuracy or awkward estimates.
• Avoid mixing magnitudes by adjusting the magnitude of your data before running
the regression.
8. Other Regression Problems
Spurious Correlation
• In a spurious correlation two variables appear related because of the way they are
defined.
• This problem is called the size effect or problem of totals.