Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Lecture 04

The document discusses the influence of outliers and high-leverage points on multiple regression results, emphasizing the importance of detecting and analyzing these observations to avoid biased outcomes. It introduces methods for identifying influential points, such as Cook's distance and leverage, and explains the use of dummy variables in regression analysis to represent qualitative data. Additionally, it covers logistic regression for qualitative dependent variables, highlighting the need for non-linear transformations to accurately model binary outcomes.

Uploaded by

amir rafique
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 04

The document discusses the influence of outliers and high-leverage points on multiple regression results, emphasizing the importance of detecting and analyzing these observations to avoid biased outcomes. It introduces methods for identifying influential points, such as Cook's distance and leverage, and explains the use of dummy variables in regression analysis to represent qualitative data. Additionally, it covers logistic regression for qualitative dependent variables, highlighting the need for non-linear transformations to accurately model binary outcomes.

Uploaded by

amir rafique
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Extensions of Multiple Regression

INFLUENCE ANALYSIS

• Besides violations of regression assumptions, there is the issue that a small


number of observations in a sample could potentially influence and bias
regression results.

• An influential observation is an observation whose inclusion may


significantly alter regression results.

• We discuss how to detect them and how to determine whether they do


influence regression results.
• Two kinds of observations may potentially influence regression results:

• A high-leverage point, a data point having an extreme value of an independent


variable

• An outlier, a data point having an extreme value of the dependent variable

• Both are substantially different from the majority of sample observations, but
each presents itself in different ways.
• Exhibit 1 shows a high-leverage point (triangle) in a sample of observations. Its
X value does not follow the trend of the other observations; rather, it has an
unusually high, possibly extreme, X value relative to the other observations.
• This observation should be investigated to determine whether it is an
influential high-leverage point.
• Also, note the two estimated regression lines: The dashed line includes the
high-leverage point in the regression sample; the solid line deletes it from the
sample.
• Exhibit 2 shows an outlier data point (triangle) in a sample of observations.
• Its Y value does not follow the trend of the other observations; rather, it has an
unusual, possibly extreme, Y value relative to its predicted value, Ŷ, resulting in
a large residual, (Y – Ŷ).
• This observation should be investigated to determine whether it is an
influential outlier.
• Also, note the two estimated regression lines: The dashed line includes the
outlier in the regression sample; the solid line deletes it from the sample.
• Outliers and high-leverage points are unusual but not necessarily a problem.

• For instance, a high-leverage point may deviate substantially from the other
observations (in terms of values of the independent variable), but it may still
lay close to the (solid) regression line.

• Problems arise if the high-leverage point or the outlier point are distant from
the regression line. In these cases, the effect of the extreme value is to “tilt” the
estimated regression line toward it, affecting slope coefficients and goodness-
of-fit statistics.
Detecting Influential Points
• A scatterplot is a straightforward way to identify outliers and high-leverage
points in simple linear regression.

• However, multiple linear regression requires a quantitative way to measure


the extreme values to reliably identify influential observations.

• A high-leverage point can be identified using a measure called leverage (hii).

• For a particular independent variable, leverage measures the distance between


the value of the ith observation of that independent variable and the mean
value of that variable across all n observations.

• Leverage is a value between 0 and 1, and the higher the leverage, the more
distant the observation’s value is from the variable’s mean and, hence, the
more influence the ith observation can potentially exert on the estimated
• The sum of the individual leverages for all observations equals k + 1, where k
is the number of independent variables and 1 is added for the intercept.

• A useful rule of thumb for the leverage measure is that if an observation’s


leverage exceeds 3((k +1)/n), then it is a potentially influential observation.

• Software packages can easily calculate the leverage measure.


• Given the broad themes of “health consciousness” and “aging population,” a
senior specialty retail analyst tasks you, a junior analyst, with initiating
coverage of nutritional supplement retailers.

• You begin by analyzing a cross-sectional dataset of 15 such specialty retailers


to determine the impact of the number of their unique products (PROD)—such
as vitamins, probiotics, antioxidants, and joint supplements—and the
percentage of their online sales (ONLINE) on operating profit margins (OPM).
• As for identifying outliers, observations with unusual dependent variable
values, the preferred method is to use studentized residuals.
• The logic and process behind this method are as follows:
• Estimate the initial regression model with n observations, then sequentially
delete observations, one at a time, and each time re-estimate the regression
model on the remaining (n – 1) observations.
• Compare the observed Y values (on n observations) with the predicted Y values
resulting from the models with the ith observation deleted—on (n – 1)
observations.
• For a given observation i, the difference (or residual) between the observed Yi
and the predicted Y with the ith observation deleted (ei*) is ei* = Yi - Yˆi*.
• Divide this residual by the estimated standard deviation or standard error of the
residuals, se*, which produces the studentized deleted residual, ti*:
• Outliers and high-leverage points are not necessarily influential.

• An observation is considered influential if its exclusion from the sample causes


substantial changes in the estimated regression function.

• Cook’s distance, or Cook’s D (Di), is a metric for identifying influential data


points. It measures how much the estimates of the regression change if
observation i is deleted and the model is re-estimated.
• It is expressed as
• The following are some key points to note about Cook’s D:

• It depends on both residuals and leverages (dependent and independent


variable information plays a role), so it is a composite measure for detecting
extreme values of both types of variables.

• It summarizes in a single measure how much all of the regression’s estimated


values change when the ith observation is deleted from the sample.

• A large Di indicates that the ith observation strongly influences the regression’s
estimated values.
• Besides detecting influential data points, we must investigate why they occur
and determine a remedy.

• In some cases, an influential data point is simply due to data input errors or
inaccurate measurements. The remedy is to either correct the erroneous data or
discard them and then to re-estimate the regression using the cleansed sample.

• Alternatively, the dataset can also be winsorized to mitigate the impact of


outliers found in the dataset.

• In other cases, the influential data points are valid, which may indicate that
important explanatory variables are omitted from the model or regression
assumptions are being violated.
DUMMY VARIABLES IN A MULTIPLE LINEAR REGRESSION

• Analysts often must use qualitative variables as independent variables in a


regression.

• One such type of variable is a dummy variable (or indicator variable).

• A dummy variable takes on a value of 1 if a particular condition is true and 0 if


that condition is false.

• A key purpose of using dummy variables is to distinguish between “groups” or


“categories” of data.
• A dummy variable may arise in several ways, including the following:
• It may reflect an inherent property of the data (i.e., industry membership).
• It may be a characteristic of the data represented by a condition that is either
true or false (i.e., a date before or after a key market event).
• It may be constructed from some characteristic of the data where the dummy
variable reflects a condition that is either true or false (i.e., firm sales less than
or greater than some value).
• We must be careful when choosing the number of dummy variables in a
regression to represent a specific condition.
• If we want to distinguish among n categories, we need n - 1 dummy variables.
• So, if we use dummy variables to denote companies belonging to one of five
industry sectors, we use four dummies, as shown in in Exhibit 10.
• The analysis still applies to five categories, but the category not assigned
becomes the “base” or “control” group and the slope of each dummy variable
is interpreted relative to the base.
• In this case, the base group is Food & Beverage.
• A common type of dummy variable is the intercept dummy. Consider a
regression model for the dependent variable Y that involves one continuous
independent variable, X, and one intercept dummy variable, D.

This single regression model estimates two lines of best fit corresponding to
the value of the dummy variable:

If D = 0, then the equation becomes Y = b0 + b1X + ε (base category).

If D = 1, then the equation becomes Y = (b0 + d0) + b1X + ε (category to


which the changed intercept applies).
• A different scenario uses a dummy that allows for slope differences, a slope
dummy, which can be explained using a simple model with one continuous
variable (X) and one slope dummy variable (D).

• Yi = b0 + b1Xi + d1DiXi + εi.

• The slope dummy variable creates an interaction term between the X variable
and the condition represented by D = 1.
• The slope dummy is interpreted as a change in the slope between the categories
captured by the dummy variable:
• If D = 0, then Y = b0 + b1X + ε (base category).
• If D = 1, then Y = b0 + (b1 + d1) X + ε (category to which changed slope
• applies).
• It is also possible for a regression to use dummies in both the slope and the
intercept. To do so, we combine the two previous models.
• Yi – b0 + d0D1 + b1Xi + d1DiXi + εi. (5)

• If D = 0, then Y = b0 + b1X + ε (base category).

• If D = 1, then Y = (b0 + d0) + (b1 + d1)X + εi (category to which both changed


intercept and changed slope apply).
• Exhibit 12 illustrates dummy variables in a regression using a cross-section of mutual fund data.
An analyst has been tasked with analyzing how mutual fund characteristics affect fund returns.
She uses a large database of mutual funds that includes several styles: blend, growth, and value.

• The dependent variable is fund five-year average annual return.

• The independent variables are


■ fund expense ratio (EXP),
■ fund portfolio cash ratio (CASH),
■ fund age (AGE), and
■ the natural logarithm of fund size (SIZE).

Given three possible style categories, she uses n - 1 = 2 dummy variables:


■ BLEND, which takes a value of 1 if the fund is a blend fund and 0 otherwise;
■ GROWTH, which takes a value of 1 if the fund is a growth fund and 0 otherwise; and
■ VALUE, the base category without a dummy
Returns i = b0 + b1EXPi + b2CASHi + b3AGEi + b4SIZEi + d1BLENDi + d2GROWTHi + εi.
Returns i = b0 + b1EXPi + b2CASHi + b3AGEi + b4SIZEi + d1BLENDi + d2GROWTHi + εi.

The dummy coefficients—0.66 for BLEND and 2.50 for GROWTH—suggest blend funds deliver average
annual returns exceeding those of the value category by 0.66% while growth funds deliver 2.50% more
than the base value category.

Moreover, the intercept coefficient suggests that an average annual return of –2.91% is unexplained by the
model’s independent variables.
Returns i = b0 + b1EXPi + b2CASHi + b3AGEi + b4SIZEi + d1BLENDi + d2GROWTHi +

d3AGE_BLENDi + d4AGE_GROWTHi + εi.

The interaction term AGE_GROWTH is statistically significant, with a p-value = 0.01, implying for
growth funds an extra annual return with each year of age equal to the sum of the AGE and
AGE_GROWTH coefficients, or 0.085% (= 0.065% + 0.020%). So, the “slope” coefficient for
GROWTH (with respect to AGE) is the sum of those two coefficients. Finally, we can interpret the
overall result as suggesting that growth funds’ returns exceed those of value funds by 2.347%, or
2.262% (GROWTH) plus 0.085% (AGE + AGE_GROWTH), for each year of a fund’s age (since
MULTIPLE LINEAR REGRESSION WITH QUALITATIVE
DEPENDENT VARIABLES

• A qualitative dependent variable (categorical dependent variable) is an


outcome variable describing data that fit into categories.

• For example, to predict whether a company will go bankrupt or not, we need a


qualitative dependent variable (bankrupt or not bankrupt) and company
financial performance data (e.g., return on equity, debt-to-equity ratio, or debt
rating) as independent variables.

• In this example, the bankrupt or not bankrupt qualitative dependent variable is


binary, but a dependent variable that falls into more than two categories is also
possible.
• In contrast to a linear regression, the dependent variable here is not continuous
but discrete (binary).

• Estimating such a model using linear regression is not appropriate.

• If we were to try to estimate this using the qualitative dependent variable, such
as Y = 1 if bankrupt and 0 if not, in a linear model with three independent
variables, then we would be estimating a linear probability model:

Yi = b0 + b1X1i + b2X2i + b3X3i + εi.


• The problem with this form is that the predicted value of the
dependent variable could be greater than 1 or less than 0,
depending on the estimated coefficients bi and the value of
observed independent variables.

• Generating predicted values above 1.0 or below 0 would be


invalid, because the probability of bankruptcy (or of anything)
cannot be greater than 1.0 or less than 0.

• Moreover, linear regression assumes the relationship between


the probability of bankruptcy and each financial variable is
linear over the range of the financial variable, which might be
unrealistic.

• For example, one can reasonably expect that the probability of


bankruptcy and the debt-to-equity ratio are not linearly related
• To address these issues, we apply a non-linear transformation to the probability of
bankruptcy and relate the transformed probabilities linearly to the independent variables.

• The most commonly used transformation is the logistic transformation.

• Below figure shows the linear probability model (Panel A) and logistic regression model
(Panel B), where the logit model’s non-linear function takes on a sigmoidal shape and is
approximately linear except when probability estimates are close to zero or one.
• Let P be the probability of bankruptcy or, generally, that a condition is fulfilled or an
event happens. The logistic transformation is

• This is a ratio of probabilities—the probability that the event of interest happens (P)
divided by the probability that it does not happen (1 - P), with the ratio representing
the odds of an event happening.

• For example, if the probability of a company going bankrupt is 0.75 and P/(1 – P) is
0.75/(1 - 0.75) = 3, the odds of bankruptcy are 3 to 1.

• This implies that the probability of bankruptcy is three times as large as the
probability of the company not going bankrupt.

• The natural logarithm (ln) of the odds of an event happening is the log odds, which
is also known as the logit function.
• Logistic regression (logit) uses the logistic transformation of the event
probability (P) into the log odds as the dependent variable:

Logistic regression assumes a logistic distribution for the error term; the
distribution’s shape is similar to the normal distribution but with fatter tails.
• Logistic regression coefficients are typically estimated using the maximum likelihood
estimation (MLE) method rather than by least squares.
• The prediction is binary (0 or 1), and the logistic regression function produces the
probability (P) that the outcome is 1.

• Notably, a hypothesis test that a logit regression coefficient is significantly different from
zero is similar to the test in OLS regression.

• Logistic regression does not have an equivalent measure to R2 because it cannot be fitted
using least squares.
• Pseudo-R2 has been proposed to capture the explained variation in a logistic regression
and is generated in standard software output.

• The pseudo-R2 must be interpreted with care because it can only be used to compare
different specifications of the same model (not models based on different datasets).
• Determining the marginal effect of a change in a variable in the logistic model is not as
straightforward as in a regression model.

• Unlike the linear OLS regression, a logistic regression is non-linear and the interpretation of
the estimates for a logistic regression depends not only on the estimate for the specific
variable estimate but also on the level of the other variable estimates.

• The non-linear nature of the relationship in the logistic regression estimate depicts the
marginal contribution of each variable estimate to the slope of the probability curve.

• That is why the impact of a one-unit incremental change of an independent variable on the
probability of Y = 1 depends on the level of all the other independent variables.
• In a logistic regression, the change in the probability for a given change in a variable is the
slope of the probability curve.

• The slope can be expressed as the derivative of probability.

• That probability contains both an exponential function and the derivative of that
exponential function.

• The derivative is the exponential function itself multiplied by the derivative of the
contents of the exponential function.

• Effectively in a logistic model, the value of the derivative changes depending on the slope
and its relative position.
• In the linear probability model, however, the derivative is a constant; thus, the slope is
constant and the marginal effect is a constant.
• In Exhibit 15, it is clear that the impact of a one-unit change in X1 on P(Y = 1) will depend
on the overall value of (bˆ0 + bˆ1 X1 + bˆ2 X2 + bˆ3 X3) .

• When(bˆ0 + bˆ1 X1 + bˆ2 X2 + bˆ3 X3) is very small or very large for an observation,
the impact of a one-unit change in X1 on P(Y = 1) will be small because the slope
of P(Y = 1) is close to zero.

• However, when (bˆ0 + bˆ1 X1 + bˆ2 X2 + bˆ3 X3)is near the inflection point, P(Y = 1) =0.5, a
one-unit change in X1 will have a larger impact on P(Y = 1), since the slope of P(Y = 1)
is large.
• The likelihood ratio (LR) test is a method to assess the fit of logistic
regression models and is based on the log-likelihood metric that describes the
fit to the data.

• The LR test statistic is LR = -2 × (Log-likelihood restricted model – Log-


likelihood unrestricted model).

• The LR test is distributed as chi-squared with q degrees of freedom (i.e.,


number of restrictions).
• Note the log-likelihood metric is always negative, so higher values (closer to 0)
indicate a better-fitting model.

• Importantly, unlike adjusted R2 (or adj. R2), the log-likelihood metric for a
given model is not meaningful by itself but is useful when comparing
regression models that have the same dependent variable.

• the LR test performs best when applied to large samples.

You might also like