1 Regression
1 Regression
1 Regression
In the file Auto.csv you can find data of 397 different cars. Read the csv file with
the read csv function of the Pandas package. In some rows we have question marks
indicating missing data. Mark them as NaN by using the following command to read
the data frame:
>>> auto df = pd.read csv(’../Data/Auto.csv’,na values=’?’)
Then remove all rows that are not complete with the dropna() function.
a) Perform a simple linear regression with ’mpg’ as the response y and ’horsepower’
as the predictor X with the following functions included in the statsmodels.api
package.
>>> model = sm.OLS(y,X)
>>> estimate = model.fit()
Print the results and comment on the output. What is the predicted mpg associ-
ated with a horsepower of 98?
b) Plot the data and the estimate. You can obtain the estimate values with the
following command:
>>> fitted values = estimate.fittedvalues
c) Produce diagnostic plots of the least squares regression fit. You can obtain the
residuals, the studentized residuals and the leverages with the following functions:
>>> residuals = estimate.resid.values
>>> studentized residuals =
OLSInfluence(estimate).resid studentized internal
>>> leverages = OLSInfluence(estimate).influence
Comment on any problem you see with the fit.
1/4
MLDA Exercise 1 – Linear Regression SS 2023
a) Produce a scatterplot matrix which includes all the variables in the dataset. To
do this use the scatter matrix command included in the Pandas package.
b) Compute the matrix of correlations between the variables by using the following
command:
>>> auto df.corr()
Remark: auto df is the Auto dataframe. The variable ’name’ is excluded auto-
matically out of the correlation matrix, as it is a qualitative variable.
c) Perform a multiple linear regression with mpg as the response y and all other
variables (except ’name’) as the predictors X. Print the results and comment on
outputs.
d) Compute the variance inflation factors with the following command:
>>> VIFs = [(predictor, variance inflation factor(X.values, ))
for ,predictor in enumerate(list(X))]
Comment the output.
e) Produce diagnostic plots of the linear regression fit. Comment on any problems
you see with the fit. Do the residual plot suggests any unusually large outliers?
Does the leverage plot identify any observations with unusually high leverage?
f) Fit linear regression models with interaction effects. Do any interactions appear
to be statistically significant? Use the example:
2/4
MLDA Exercise 1 – Linear Regression SS 2023
y = 2 + 2x1 + 0.3x2 + ϵ
3/4
MLDA Exercise 1 – Linear Regression SS 2023
Pandas
Matplotlib
Numpy
Statsmodels
OLSInfluence function
4/4