Linear Regression
Linear Regression
Linear Regression
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
import scipy as sp
import statsmodels.tsa.api as smt
1
Variable Description
• CRIM: Per capita crime rate by town
• ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS: Proportion of non-retail business acres per town.
• NOX: Nitric oxides concentration (parts per 10 million)
• RM: Average number of rooms per dwelling
• AGE: Proportion of owner-occupied units built prior to 1940
• DIS: Weighted distances to five Boston employment centers
• RAD: Index of accessibility to radial highways
• TAX: Full-value property-tax rate per 10,000 dollars
• PTRATIO: Pupil-teacher ratio by town
• BLACK: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
• LSTAT: percentage of lower status of the population
• MEDV: Median value of owner-occupied homes in 1,000’s dollars
Target Variable is Medv which is the House Price
[3]: data.shape
print("Columns:",data.shape[1])
print("Rows:",data.shape[0])
Columns: 13
Rows: 506
[ ]:
2 Splitting data into training & testing sets (Validation set ap-
proach)
[6]: # where training set is 20% of actual data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
2
3 Creating a linear regression model object
[7]: model=LinearRegression()
[8]: LinearRegression()
[10]: 38.42706211257645
[11]: 76.90953567794605
3
9 Mean Squared Error for the testing data
[13]: MSE=mean_squared_error(y_pred,y_test)
MSE
[13]: 34.21225254753325
[14]: 5.84912408378667
[16]: #define x, y
x = data.drop(columns = ['Sales'], axis=1) #independent variables
y = data.Sales #target vaiable
data.shape
4
==============================================================================
Dep. Variable: Sales R-squared: 0.903
Model: OLS Adj. R-squared: 0.901
Method: Least Squares F-statistic: 605.4
Date: Sun, 21 Jan 2024 Prob (F-statistic): 8.13e-99
Time: 15:17:22 Log-Likelihood: -383.34
No. Observations: 200 AIC: 774.7
Df Residuals: 196 BIC: 787.9
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.6251 0.308 15.041 0.000 4.019 5.232
TV 0.0544 0.001 39.592 0.000 0.052 0.057
Radio 0.1070 0.008 12.604 0.000 0.090 0.124
Newspaper 0.0003 0.006 0.058 0.954 -0.011 0.012
==============================================================================
Omnibus: 16.081 Durbin-Watson: 2.251
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.655
Skew: -0.431 Prob(JB): 9.88e-07
Kurtosis: 4.605 Cond. No. 454.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[17]: (200, 4)
12.1 1. Linearity
[23]: fig,(ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(5, 12))
ax1.scatter(data['nox'],data['medv'])
ax1.set_title('nox-Nitric oxides concentration - parts per 10 million')
ax2.scatter(data['rm'],data['medv'])
ax2.set_title('rm-Average number of rooms per dwelling')
ax3.scatter(data['age'],data['medv'])
5
ax3.set_title('age-Proportion of owner-occupied units built prior to 1940')
plt.show()
6
7
12.2 2. Homoscedasticity
[24]: # Calculating residuals from the plot
residuals = y_test - y_pred
plt.scatter(y_pred, standardized_residuals)
plt.xlabel("Predicted values")
plt.ylabel("Standardized Residuals")
plt.title("Residuals vs Fitted Values")
plt.axhline(y=0, color='r', linestyle='--') # Add a horizontal line at y=0 for␣
↪reference
plt.show()
8
12.3 3. Multivariate Normality
Normality of Residuals
9
plt.show()
plt.show()
10
[28]: # homework : apply the tests Kolmogorov-Smirnov test, Shapiro–Wilk test to␣
↪check multivariate normality.
C:\Users\ramaleer\AppData\Local\Temp\ipykernel_9772\3320808677.py:3:
UserWarning: Matplotlib is currently using
module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot
show the figure.
acf.show()
11
[30]: #perform Durbin-Watson test
durbin_watson(residuals)
[30]: 2.0260473760330022
for i in range(x_train.shape[1]):
vif.append(variance_inflation_factor(x_train,i))
pd.DataFrame({'Vif':vif}, index=data.columns[0:12])
12
[31]: Vif
crim 2.095894
zn 2.928062
indus 13.829768
nox 80.580602
rm 80.295207
age 22.821527
dis 14.784871
rad 14.694806
tax 57.284635
ptratio 87.191073
black 21.647351
lstat 11.319795
We can find the degree of correlation with the help of Variation Inflation Factor(VIF) It can be
interpreted as :
1 = Not correlated
1–5 = Moderately correlated
greater than 5 = Highly correlated
[32]: <AxesSubplot:>
13
[33]: plt.scatter(data["rad"],data["tax"])
plt.xlabel("rad")
plt.ylabel("tax")
plt.show()
14
[34]: # Homework : Use Other Regression Methods to come up with predictive models for␣
↪mdev: House price
[ ]:
15