Lecture 04
Lecture 04
INFLUENCE ANALYSIS
• Both are substantially different from the majority of sample observations, but
each presents itself in different ways.
• Exhibit 1 shows a high-leverage point (triangle) in a sample of observations. Its
X value does not follow the trend of the other observations; rather, it has an
unusually high, possibly extreme, X value relative to the other observations.
• This observation should be investigated to determine whether it is an
influential high-leverage point.
• Also, note the two estimated regression lines: The dashed line includes the
high-leverage point in the regression sample; the solid line deletes it from the
sample.
• Exhibit 2 shows an outlier data point (triangle) in a sample of observations.
• Its Y value does not follow the trend of the other observations; rather, it has an
unusual, possibly extreme, Y value relative to its predicted value, Ŷ, resulting in
a large residual, (Y – Ŷ).
• This observation should be investigated to determine whether it is an
influential outlier.
• Also, note the two estimated regression lines: The dashed line includes the
outlier in the regression sample; the solid line deletes it from the sample.
• Outliers and high-leverage points are unusual but not necessarily a problem.
• For instance, a high-leverage point may deviate substantially from the other
observations (in terms of values of the independent variable), but it may still
lay close to the (solid) regression line.
• Problems arise if the high-leverage point or the outlier point are distant from
the regression line. In these cases, the effect of the extreme value is to “tilt” the
estimated regression line toward it, affecting slope coefficients and goodness-
of-fit statistics.
Detecting Influential Points
• A scatterplot is a straightforward way to identify outliers and high-leverage
points in simple linear regression.
• Leverage is a value between 0 and 1, and the higher the leverage, the more
distant the observation’s value is from the variable’s mean and, hence, the
more influence the ith observation can potentially exert on the estimated
• The sum of the individual leverages for all observations equals k + 1, where k
is the number of independent variables and 1 is added for the intercept.
• A large Di indicates that the ith observation strongly influences the regression’s
estimated values.
• Besides detecting influential data points, we must investigate why they occur
and determine a remedy.
• In some cases, an influential data point is simply due to data input errors or
inaccurate measurements. The remedy is to either correct the erroneous data or
discard them and then to re-estimate the regression using the cleansed sample.
• In other cases, the influential data points are valid, which may indicate that
important explanatory variables are omitted from the model or regression
assumptions are being violated.
DUMMY VARIABLES IN A MULTIPLE LINEAR REGRESSION
This single regression model estimates two lines of best fit corresponding to
the value of the dummy variable:
• The slope dummy variable creates an interaction term between the X variable
and the condition represented by D = 1.
• The slope dummy is interpreted as a change in the slope between the categories
captured by the dummy variable:
• If D = 0, then Y = b0 + b1X + ε (base category).
• If D = 1, then Y = b0 + (b1 + d1) X + ε (category to which changed slope
• applies).
• It is also possible for a regression to use dummies in both the slope and the
intercept. To do so, we combine the two previous models.
• Yi – b0 + d0D1 + b1Xi + d1DiXi + εi. (5)
The dummy coefficients—0.66 for BLEND and 2.50 for GROWTH—suggest blend funds deliver average
annual returns exceeding those of the value category by 0.66% while growth funds deliver 2.50% more
than the base value category.
Moreover, the intercept coefficient suggests that an average annual return of –2.91% is unexplained by the
model’s independent variables.
Returns i = b0 + b1EXPi + b2CASHi + b3AGEi + b4SIZEi + d1BLENDi + d2GROWTHi +
The interaction term AGE_GROWTH is statistically significant, with a p-value = 0.01, implying for
growth funds an extra annual return with each year of age equal to the sum of the AGE and
AGE_GROWTH coefficients, or 0.085% (= 0.065% + 0.020%). So, the “slope” coefficient for
GROWTH (with respect to AGE) is the sum of those two coefficients. Finally, we can interpret the
overall result as suggesting that growth funds’ returns exceed those of value funds by 2.347%, or
2.262% (GROWTH) plus 0.085% (AGE + AGE_GROWTH), for each year of a fund’s age (since
MULTIPLE LINEAR REGRESSION WITH QUALITATIVE
DEPENDENT VARIABLES
• If we were to try to estimate this using the qualitative dependent variable, such
as Y = 1 if bankrupt and 0 if not, in a linear model with three independent
variables, then we would be estimating a linear probability model:
• Below figure shows the linear probability model (Panel A) and logistic regression model
(Panel B), where the logit model’s non-linear function takes on a sigmoidal shape and is
approximately linear except when probability estimates are close to zero or one.
• Let P be the probability of bankruptcy or, generally, that a condition is fulfilled or an
event happens. The logistic transformation is
• This is a ratio of probabilities—the probability that the event of interest happens (P)
divided by the probability that it does not happen (1 - P), with the ratio representing
the odds of an event happening.
• For example, if the probability of a company going bankrupt is 0.75 and P/(1 – P) is
0.75/(1 - 0.75) = 3, the odds of bankruptcy are 3 to 1.
• This implies that the probability of bankruptcy is three times as large as the
probability of the company not going bankrupt.
• The natural logarithm (ln) of the odds of an event happening is the log odds, which
is also known as the logit function.
• Logistic regression (logit) uses the logistic transformation of the event
probability (P) into the log odds as the dependent variable:
Logistic regression assumes a logistic distribution for the error term; the
distribution’s shape is similar to the normal distribution but with fatter tails.
• Logistic regression coefficients are typically estimated using the maximum likelihood
estimation (MLE) method rather than by least squares.
• The prediction is binary (0 or 1), and the logistic regression function produces the
probability (P) that the outcome is 1.
• Notably, a hypothesis test that a logit regression coefficient is significantly different from
zero is similar to the test in OLS regression.
• Logistic regression does not have an equivalent measure to R2 because it cannot be fitted
using least squares.
• Pseudo-R2 has been proposed to capture the explained variation in a logistic regression
and is generated in standard software output.
• The pseudo-R2 must be interpreted with care because it can only be used to compare
different specifications of the same model (not models based on different datasets).
• Determining the marginal effect of a change in a variable in the logistic model is not as
straightforward as in a regression model.
• Unlike the linear OLS regression, a logistic regression is non-linear and the interpretation of
the estimates for a logistic regression depends not only on the estimate for the specific
variable estimate but also on the level of the other variable estimates.
• The non-linear nature of the relationship in the logistic regression estimate depicts the
marginal contribution of each variable estimate to the slope of the probability curve.
• That is why the impact of a one-unit incremental change of an independent variable on the
probability of Y = 1 depends on the level of all the other independent variables.
• In a logistic regression, the change in the probability for a given change in a variable is the
slope of the probability curve.
• That probability contains both an exponential function and the derivative of that
exponential function.
• The derivative is the exponential function itself multiplied by the derivative of the
contents of the exponential function.
• Effectively in a logistic model, the value of the derivative changes depending on the slope
and its relative position.
• In the linear probability model, however, the derivative is a constant; thus, the slope is
constant and the marginal effect is a constant.
• In Exhibit 15, it is clear that the impact of a one-unit change in X1 on P(Y = 1) will depend
on the overall value of (bˆ0 + bˆ1 X1 + bˆ2 X2 + bˆ3 X3) .
• When(bˆ0 + bˆ1 X1 + bˆ2 X2 + bˆ3 X3) is very small or very large for an observation,
the impact of a one-unit change in X1 on P(Y = 1) will be small because the slope
of P(Y = 1) is close to zero.
• However, when (bˆ0 + bˆ1 X1 + bˆ2 X2 + bˆ3 X3)is near the inflection point, P(Y = 1) =0.5, a
one-unit change in X1 will have a larger impact on P(Y = 1), since the slope of P(Y = 1)
is large.
• The likelihood ratio (LR) test is a method to assess the fit of logistic
regression models and is based on the log-likelihood metric that describes the
fit to the data.
• Importantly, unlike adjusted R2 (or adj. R2), the log-likelihood metric for a
given model is not meaningful by itself but is useful when comparing
regression models that have the same dependent variable.