Linear Regression and SVR
Linear Regression and SVR
import numpy as np
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Data ingestion
In [3]: df
EDA
In [4]: df.shape
(2075259, 9)
Out[4]:
df1= df.sample(30000)
In [6]: df1.info()
<class 'pandas.core.frame.DataFrame'>
Feature Information
Global_active_power: household global minute-averaged active power (in kilowatt)
global_reactive_power: household global minute-averaged reactive power (in
kilowatt)
voltage: minute-averaged voltage (in volt)
global_intensity: household global minute-averaged current intensity (in ampere)
date: Date in format dd/mm/yyyy
time: time in format hh:mm:ss
sub_metering_1, sub_metering_1, and sub_metering_1 are the meter readings
In [15]: df1.info()
<class 'pandas.core.frame.DataFrame'>
df1['Global_reactive_power']= df1['Global_reactive_power'].astype(float)
df1['Voltage']= df1['Voltage'].astype(float)
df1['Global_intensity']= df1['Global_intensity'].astype(float)
df1['Sub_metering_1']= df1['Sub_metering_1'].astype(float)
df1['Sub_metering_2']= df1['Sub_metering_2'].astype(float)
In [18]: df1.info()
<class 'pandas.core.frame.DataFrame'>
In [20]: dff.info()
<class 'pandas.core.frame.DataFrame'>
In [21]: dff.describe()
Primary Observation:
There are null values in Global_active_power, Global_reactive_power, Voltage,
Global_intensity and total_metering.
And Global_active_power, Global_reactive_power,
Global_intensity and total_metering features seems to have outliers and hence data is
skewed.
Global_active_power : has observed mean power of 1.087975 and standard deviation
1.052817, minimum power 0.078000, range og 25th percentile to 75th percentile is
[0.308000 to 1.530000], but maximum power 10.670000.
This clearly shows data is
skewed and presence of outliers.
Global_reactive_power : has observed mean power of 0.123332 and standard deviation
0.113058, minimum power 0, range og 25th percentile to 75th percentile is [0.048000 to
0.194000], but maximum power 1.186000.
This clearly shows data is skewed and
presence of outliers.
Voltage : has observed mean voltage of 240.853321 and standard deviation 3.241289,
minimum voltage 225.140000, range og 25th percentile to 75th percentile is
[238.990000 to 242.910000], and maximum voltage 253.420000.
Global_intensity : has observed mean current intensity of 4.612513 and standard
deviation 4.426749, minimum is 0.200000, range og 25th percentile to 75th percentile is
[1.400000 to 6.400000], but maximum value is 46.400000.
This clearly shows data is
skewed and presence of outliers.
total_metering : has observed mean total metering is 8.815839 and standard deviation
12.710086, minimum reading 0, range og 25th percentile to 75th percentile is [0 to 18],
but maximum reading is 126.
This is skewed ans have outliers.
1032508 False
Out[22]:
1794736 False
590021 False
1946737 False
691314 False
...
340664 False
2063841 False
1437739 False
1027465 False
1532620 False
In [23]: dff.skew()
Global_active_power 1.805078
Out[23]:
Global_reactive_power 1.272971
Voltage -0.329061
Global_intensity 1.868954
date -0.000066
month -0.000640
year -0.015933
hour 0.002056
Minutes 0.003394
total_metering 2.244892
dtype: float64
Handling null values{replacing with mean when features are without outliers and replacing with median
when features are with outliers}
dff['Global_intensity'] = dff['Global_intensity'].fillna(dff['Global_intensity'].me
dff['total_metering'] = dff['total_metering'].fillna(dff['total_metering'].median()
In [25]: dff.head()
plt.subplot(5, 3, i+1)
sns.kdeplot(x=dff[dff.columns[i]],shade=True, color='b')
plt.xlabel(dff.columns[i])
plt.tight_layout()
plt.subplot(5, 3, i+1)
plt.tight_layout()
<AxesSubplot:xlabel='hour', ylabel='total_metering'>
Out[28]:
Observation:
total reading incraeses in the morning around 7 am then with a dip aroung 3 or 4 of
afternoon, it again rises in evening time then dip after 8 or 9 of the night.
<AxesSubplot:xlabel='month', ylabel='total_metering'>
Out[29]:
Observation:
In july there is a dip in power consumption.
<AxesSubplot:xlabel='year', ylabel='total_metering'>
Out[30]:
Observation:
Checking correlation
In [31]: dff.corr()
In [32]: plt.figure(figsize=(15,15))
sns.heatmap(data=dff.corr(), annot=True);
Observation:
Handling multicollinearity
vif_data
0 1317.390296 Global_active_power
1 2.937605 Global_reactive_power
2 7550.255484 Voltage
3 1337.011908 Global_intensity
4 4.179946 date
5 4.530054 month
6 7638.125596 year
7 4.222324 hour
8 3.889487 Minutes
9 5.372782 total_metering
In [36]: dff.head()
Checking outliers
In [37]: plt.figure(figsize=(15,15))
plt.subplot(5,3,i+1)
sns.boxplot(dff[dff.columns[i]])
plt.tight_layout()
In [40]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails
variables = ['Global_reactive_power'])
dff['Global_reactive_power'] = winsorizer.fit_transform(dff[['Global_reactive_power
In [41]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails
variables = ['Voltage'])
dff['Voltage'] = winsorizer.fit_transform(dff[['Voltage']])
In [42]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails
variables = ['Global_intensity'])
dff['Global_intensity'] = winsorizer.fit_transform(dff[['Global_intensity']])
In [43]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails
variables = ['total_metering'])
dff['total_metering'] = winsorizer.fit_transform(dff[['total_metering']])
In [44]: plt.figure(figsize=(15,15))
plt.subplot(5,3,i+1)
sns.boxplot(dff[dff.columns[i]])
plt.tight_layout()
In [45]: dff.to_csv("power_consumption_data.csv")
client = pymongo.MongoClient("mongodb://raje:mongodb@ac-tl0bnvo-shard-00-00.bkfbasy
db = client.test
In [50]: ## first we have to convert this dataframe into dict or josn format as this is the
collection.insert_many(dff_dict)
#database = client['power_consumption']
#collection = database["power_consumption_data"]
In [54]: ## fetching data from the collection in mongodb{using find will return all the occu
data = pd.DataFrame(list(collection.find()))
In [55]: data
data.drop(columns=['_id'],axis=1,inplace=True)
In [57]: data.head()
In [59]: X
In [60]: y = data['total_metering']
In [61]: y.head()
0 18.0
Out[61]:
1 1.0
2 0.0
3 17.0
4 18.0
In [63]: X_train
In [64]: X_test
In [65]: y_train
13707 0.0
Out[65]:
10403 1.0
6673 0.0
28904 11.0
2987 0.0
...
28017 0.0
17728 0.0
29199 19.0
7293 19.0
17673 2.0
Standardising data
In [70]: X_train
Out[70]:
-0.36264487, -0.5475558 ],
[ 1.29044859, -0.26217727, -0.26946362, ..., -0.44053182,
1.08034131, -0.37407043],
[-1.14867616, 1.48471233, -0.90523815, ..., -1.30850514,
-1.08413797, -0.72104116],
...,
-0.50694349, -1.18366881],
[-0.63417328, 0.28372573, 1.05506665, ..., 0.4274415 ,
1.51323716, -0.43189889],
[-0.15778173, -0.2975004 , -0.79927572, ..., 0.4274415 ,
X_test
Out[72]:
1.65753578, -0.60538425],
[-0.59606196, -0.05023846, 0.73717939, ..., -0.44053182,
-0.65124211, 1.59209705],
[ 0.14710886, 0.57915559, 1.58487876, ..., -1.59782958,
-1.08413797, -1.12584035],
...,
1.22463993, -0.43189889],
[-1.14867616, 0.69154738, -0.90523815, ..., -1.30850514,
0.50314684, 0.89815559],
[-0.50078365, -0.2461213 , 0.41929213, ..., -0.44053182,
1.22463993, -0.83669807]])
Pickling
pickle.dump(scaler,f)
Linear Regression
In [74]: from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
LinearRegression()
Out[76]:
y_pred = linear_reg.predict(X_test)
y_pred
Out[78]:
-0.32246875, 11.67767303])
cost functions
In [79]: from sklearn.metrics import mean_squared_error
Test truth data and predicted data should follow linear relationship
{this is an indiaction
of a good model}
plt.ylabel("Predicted data")
Residual distribution
residual_linear.head()
20412 1.317430
Out[82]:
1296 2.798263
3906 -24.577479
20454 1.194867
5200 -3.652491
<seaborn.axisgrid.FacetGrid at 0x15bb818faf0>
Out[83]:
Uniform distribution
Residuals vs Predictions should follow a uniform dstribution.
plt.ylabel('Residuals')
Accuracy of the model with train data and with test data
In [85]: linear_reg.score(X_train,y_train)
0.6849425651586472
Out[85]:
In [86]: linear_reg.score(X_test,y_test)
0.690139882731237
Out[86]:
Performance Matrics
R Square and Adjusted R Square values
adjr2_score_lr=1-((1-r2_score_lr)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
Ridge Regression
In [89]: from sklearn.linear_model import Ridge
ridge_reg= Ridge()
ridge_reg.fit(X_train, y_train)
Ridge()
Out[91]:
y_pred_r = ridge_reg.predict(X_test)
y_pred_r
Out[93]:
-0.32214097, 11.67760058])
In [95]: ridge_reg.score(X_train,y_train)
0.6849425631543848
Out[95]:
In [96]: ridge_reg.score(X_test,y_test)
0.6901402931343383
Out[96]:
ridge_adjr2_score=1-((1-ridge_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1
print("Adjusted R square accuracy is {} percent".format(round(ridge_adjr2_score*100
Lasso Regression
In [98]: from sklearn.linear_model import Lasso
lasso_reg = Lasso()
lasso_reg
Lasso()
Out[99]:
lasso_reg.fit(X_train, y_train)
Lasso()
Out[100]:
-0. ]
y_pred_lasso = lasso_reg.predict(X_test)
y_pred_lasso
Out[102]:
0.93010027, 11.86774416])
In [104… lasso_reg.score(X_train,y_train)
0.6693582767991895
Out[104]:
In [105… lasso_reg.score(X_test,y_test)
0.6743540592319746
Out[105]:
lasso_adjr2_score=1-((1-lasso_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1
print("Adjusted R square accuracy is {} percent".format(round(lasso_adjr2_score*100
Elastic-Net Regression
In [107… from sklearn.linear_model import ElasticNet
elastic_reg=ElasticNet()
elastic_reg
ElasticNet()
Out[108]:
elastic_reg.fit(X_train, y_train)
ElasticNet()
Out[109]:
-0. ]
In [111… elastic_y_pred=elastic_reg.predict(X_test)
elastic_y_pred
Out[111]:
2.82884024, 10.9403686 ])
In [113… elastic_reg.score(X_train,y_train)
0.5870637265888694
Out[113]:
In [114… elastic_reg.score(X_test,y_test)
0.5931641182844434
Out[114]:
elastic_reg_adj_r2_score=1-((1-elastic_reg_r2_score)*(len(y_test)-1)/(len(y_test)-X
print("Adjusted R square accuracy is {} percent".format(round(elastic_reg_adj_r2_sc
SVR
In [116… from sklearn.svm import SVR
svr = SVR()
svr
SVR()
Out[117]:
svr.fit(X_train, y_train)
SVR()
Out[118]:
svr_y_pred= svr.predict(X_test)
svr_y_pred
Out[119]:
0.21280761, 12.93475299])
0.7336778695440571
Out[121]:
0.7310080566997736
Out[122]:
svr_adj_r2_score=1-((1-svr_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1
print("Adjusted R square accuracy is {} percent".format(round(svr_adj_r2_score*100,
In [125… model_params = {
'Ridge Regression': {
'model': Ridge(),
'params' : {
'alpha': [1,5,10,20]
},
'Lasso Regression': {
'model': Lasso(),
'params' : {
'alpha': [1,5,10,20]
},
'Elastic-Net Regression' : {
'model': ElasticNet(),
'params': {
'alpha': [1,5,10,20],
'l1_ratio':[0.5,1,1.5,2]
},
'SVR':{
'model': SVR(),
'params':{
'C':[1,5,10,20]
In [126… model_params.items()
In [127… ##scaling the independent features before fitting it inside grid object{to simplify
X1= scaler.fit_transform(X)
In [128… scores = []
scores.append({
'model': model_name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df
Conclusion:
The SVR model with 'rbf kernel' is the best model for this household power
consumption data
In [ ]: