Multiple Linear Regression Using Python Machine Learning: Kaleab Woldemariam, June 2017
Multiple Linear Regression Using Python Machine Learning: Kaleab Woldemariam, June 2017
(i) Normality
(ii) Homogeneity of Variance
Note that land use-land cover (LULC) data were categorical and
needed to be converted to dummies (0/1 values).I used a Pandas
function, pd.get_dummies, to manipulate the nominal LULC data to
include it in predicting NPP.
Train/Test
The model is trained to predict the known outputs and later
tested using test data and applied to generalize other non-
trained data. Test data is used to test the prediction ability
(accuracy) of the model. Training data (X_train,y_train) is used
to fit the regression model(make a linear model).This model is
used to predict NPP2001 from independent variables.
import math
import numpy as np
import pandas as pd
from sklearn import preprocessing,svm
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection,metrics
lm = LinearRegression(n_jobs=-1)
plt.legend(loc=4)
plt.title("Homogeneity of Variance")
plt.scatter(y_test,y_test-predictions)
plt.xlabel("Actual NPP2001")
plt.ylabel("Residual")
plt.show()
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)
kf=KFold(n_splits=10, random_state=None,shuffle=True)
for train_index, test_index in kf.split(X):
accuracy=metrics.r2_score(y,predictions2)
The result indicates that the predictors account for 70.2% of the
variance in the Net Primary Productivity for year 2001.
Reference
https://www.medium.com/towards-data-science/train-test-split-and-
cross-validation-in-python-80b61beca4b6 retrieved on June 28,
2017.