0% found this document useful (0 votes)

62 views

Multiple Linear Regression Using Python Machine Learning: Kaleab Woldemariam, June 2017

1) The document describes using multiple linear regression in Python to predict net primary productivity (NPP) in Ethiopia based on climate and land use data. 2) Independent variables included precipitation, land cover, fraction of photosynthetically active radiation, elevation, temperature, vapor pressure and water stress index. 3) The model was trained on 75% of the data and tested on 25% to predict NPP values and assess accuracy. Cross-validation was also used, yielding a prediction R2 value of 70.2%, meaning the predictors explained 70.2% of the variance in NPP.

Uploaded by

apurv shukla

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

Multiple Linear Regression Using Python Machine Learning: Kaleab Woldemariam, June 2017

Uploaded by

apurv shukla

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Multiple Linear Regression using Python Machine Learning

Objective:- The objective of this exercise is to predict the Net

Primary Productivity-(NPP, major ecosystem health indicator) from
climate and land use data for Upper Blue Nile Basin, Ethiopia,
East Africa(Figure 1). It’s derived from Gross Primary
Productivity (GPP) which is an ecosystem level parameter that
refers to the rate at which green plants produce organic matter
by assimilating carbon dioxide using solar energy through
photosynthesis(Liang et al., 2012). Net Primary Productivity is
the difference between GPP and plant autotrophic respiration.
Approximately 50% the organic matter generated by gross primary
production is released into the atmosphere through plant
respiration. The other half, which constitutes NPP is the biomass
produced in a given time (Liang et al., 2012). The following
variables were used:
 The NPP dataset(dependent variable) from the year 2001 to 2010
was downloaded from NASA’s Reverb/ECHO website. Data from 2001
was taken for regression analysis.
 Precipitation: GPCC-Global Precipitation Climatology Centre,
raster image.
 Land use land cover classification image for 2001 and 2010
were acquired from MODIS Land Cover(MCDQ12) from Reverb/ECHO.
 Fraction of Photosynthetically Active Radiation (fAPAR) SPOT
satellite, AfSIS raster image(ftp://africagrids.org)
 Digital Elevation Model(DEM)- ftp://srtm.csi.cgiar.org.
 Minimum Temperature, Vapor Pressure, WSI(Water Stress Index )
derived from Potential Evapotranspiration and Actual
Evapotranspiration of CRU 3.22 Time-Series data (Climate
Research Unit, University of East Anglia)

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

Figure 1: Location of the Study Area.

In this exercise, a total of 2,377 random sample points were

collected from the raster data using ArcGIS 10.3. I used Pandas
module for loading comma delimited(csv) file, Numpy module to
convert the data into array, Scikit_Learn for computing multiple
linear regression and Matplotlib module for plotting the result.

Certain assumptions about the dataset must be met before

conducting multiple linear regression. In ecological studies,
statistical and spatial contexts must be considered in modeling.
To simplify, statistical assumptions were met. Multiple linear
regression assumes

(i) Normality
(ii) Homogeneity of Variance

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

(iii) Fixed X (X represents explanatory variables)

(iv) Independence
(v) Correct model specification (Zuur et al., 2007).

Note that land use-land cover (LULC) data were categorical and
needed to be converted to dummies (0/1 values).I used a Pandas
function, pd.get_dummies, to manipulate the nominal LULC data to
include it in predicting NPP.

To segregate the numerical and categorical data, I used a

separate pandas DataFrame for Precipitation, fAPAR, Minimum
Temperature, Vapor Pressure, WSI features (numerical independent
variables)as data1 and categorical LULC as dummies and eventually
join the two datasets as a numpy array “X”. The dependent
variable NPP2001 was also converted to array “y” using numpy.

Train/Test
The model is trained to predict the known outputs and later
tested using test data and applied to generalize other non-
trained data. Test data is used to test the prediction ability
(accuracy) of the model. Training data (X_train,y_train) is used
to fit the regression model(make a linear model).This model is
used to predict NPP2001 from independent variables.

'''Regression for predicting NPP using features(independent variables) in Machine

Learning '''

import math
import numpy as np
import pandas as pd
from sklearn import preprocessing,svm
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection,metrics

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import
train_test_split,KFold,cross_val_score,cross_val_predict

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

import matplotlib.pyplot as plt

from matplotlib import style
import datetime
style.use('ggplot')
raw_data='mydata_2001_2377_BlueNile.csv'
df = pd.read_csv(raw_data)
# Create a DataFrame for numerical features
data1 = pd.DataFrame(df,
columns=['b1_PG2001','SPTFPR2001','b1_Tmn','X2001WSI','b1_Vap','Elevation'])
print(data1.shape)

# Create a DataFrame for categorical features

cols_to_transform =
pd.DataFrame(df,columns=['Forest','Closed_Shrublands','Open_Shrublands','Woody_Savanna
s','Savannas','Grasslands','Croplands'])
dummies = pd.get_dummies(cols_to_transform)
# Join data1 and dummies using Numpy and yield as array
X = np.array(data1.join(dummies))

# Specify the dependent variable as array

y = np.array(df['NPP2001'])

lm = LinearRegression(n_jobs=-1)

'''To check the accuracy/confidence level of the prediction,

we have 25% test datasets, while 75% is used for training.'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
# First we fit a model
model=lm.fit(X_train,y_train)
#print the coefficents
print("The linear cofficients",model.coef_)
# Try to predict the y ( NPP_Predict) for the test data-features(independent
variables(X_test)
predictions=lm.predict(X_test)
# Accuracy of the prediction
confidence = lm.score(X_test, y_test)
print("This is predicted NPP2001 Values",predictions)
print("This is the prediction accuracy",confidence)
plt.legend(loc=4)
plt.title("Actual NPP2001 vs. NPP2001_Predict", size=10)
plt.scatter(y_test,predictions,color='c', marker='.')
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)
plt.show()

plt.legend(loc=4)
plt.title("Homogeneity of Variance")
plt.scatter(y_test,y_test-predictions)
plt.xlabel("Actual NPP2001")
plt.ylabel("Residual")
plt.show()
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)
kf=KFold(n_splits=10, random_state=None,shuffle=True)
for train_index, test_index in kf.split(X):

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

print ("TRAIN", train_index, "TEST", test_index)

X_train,X_test=X[train_index], X[test_index]
y_train,y_test=y[train_index],y[test_index]
# Make Cross Validated predictions
predictions2=cross_val_predict(model,X,y,cv=10)
#Check the R2- the proportion of variance in the dependent variable explained by the
predictors
accuracy=metrics.r2_score(y,predictions2)
print ("This is R2",accuracy)
plt.scatter(y,predictions2,color='c', marker='.')
plt.legend(loc=4)
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)
plt.title("Actual and Predicted NPP2001 Values using 10 Fold Cross
Validation",size=10)
plt.show()

The steps used so far are:-

 Load the data.

 Convert categorical variables to dummies and join to
numerical variables.
 Split the sample (2,377 pts) into training and test sets.
 Use training data to fit a regression model.
 Made predictions based on the X_test data.
 Computed accuracy of the prediction (score).

Train/Test split is not enough to guarantee the randomness of the

samples. If samples fail to be random, this might result in
overfitting. Overfitting means the model is “too well trained”,
although it cannot be applied to other data. Overfitting happens
when the model uses too many predictors; while it works too well
on the training set, it fails on new untrained data. This means
we cannot make inferences from our model.

Cross-Validation method called – K-Folds Cross Validation is used

to subset the sample into k different subsets (or folds). We use
k-1 subsets to train our data and leave the last subset as test
data. We then average the model against each of the folds and
then finalize our model. After that we test it against the test
set. Cross Validated predictions are made by supplying

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

cross_val_predict function with the model, X(original/not test

independent variables) and the y(dependent variable),and the
cv(cross validation fold). The plot will have 10x points due to
cross validation.
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)

Cross Validated Scores [ 0.34638801 0.56139146 0.61525375 0.7076254 0.70162425

0.49563864 0.61883974 0.52543957 0.33933734 0.10156286]

# Make Cross Validated predictions

predictions2=cross_val_predict(model,X,y,cv=10)

Finally, the R2-the proportion of variance explained by the

predictors is given by:

accuracy=metrics.r2_score(y,predictions2)

The result indicates that the predictors account for 70.2% of the
variance in the Net Primary Productivity for year 2001.

The linear equation:

1.274E-04*(b1_PG2001)+2.314E-03*(SPTFPR2001)-1.147E-01*(b1_Tmn)+8.877E-
1*(X2001WSI)+1.326E-01*(b1_Vap)-3.43E-05*(Elevation)-1.0E-01*(Forest)-
1.27*(Closed_Shrublands)-9.79*(Open_Shrublands)-1.019E-01*(Woody_Savannas)-
9.549E-02*(Savannas)-1.0422*(Grasslands)-1.22E-02*Croplands

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

Reference

https://www.medium.com/towards-data-science/train-test-split-and-
cross-validation-in-python-80b61beca4b6 retrieved on June 28,
2017.

Liang, S.,Li,X., Wang, J., 2012. Advanced Remote Sensing:

Terrestrial Information Extraction and Applications, Academic
Press, pp. 800.

Zuur, A. K., Ieno, E.N., Smith, G. M., 2007. Statistics for

Biology and Health: Analyzing Ecological Data, Springer Science +
Business Media, LLC.

Kaleab Woldemariam, June 2017

Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Advances in Rockfill Structures
No ratings yet
Advances in Rockfill Structures
27 pages
Problem Set Time Value of Money
No ratings yet
Problem Set Time Value of Money
5 pages
ML manoj
No ratings yet
ML manoj
51 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
First Report
No ratings yet
First Report
14 pages
ML0101EN Reg Mulitple Linear Regression Co2 Py v1
No ratings yet
ML0101EN Reg Mulitple Linear Regression Co2 Py v1
5 pages
ICT-4202, DIP Lab Manual - 8
No ratings yet
ICT-4202, DIP Lab Manual - 8
20 pages
Gen AI
No ratings yet
Gen AI
11 pages
ML With Python Practical
No ratings yet
ML With Python Practical
22 pages
Week10 KNN Practical
No ratings yet
Week10 KNN Practical
4 pages
23-0309
No ratings yet
23-0309
44 pages
Arnav MLlab05
No ratings yet
Arnav MLlab05
12 pages
Ai HW1
No ratings yet
Ai HW1
25 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
DSG Bring Your Own Project
No ratings yet
DSG Bring Your Own Project
8 pages
Iot Da3
No ratings yet
Iot Da3
12 pages
FML File Final
No ratings yet
FML File Final
36 pages
Aakash S Project Report
No ratings yet
Aakash S Project Report
12 pages
ML-Lab07-Building and Evaluating Multivariate Regression Models in Python
No ratings yet
ML-Lab07-Building and Evaluating Multivariate Regression Models in Python
5 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
DL_EXP-6_16010422230
No ratings yet
DL_EXP-6_16010422230
8 pages
FDS Lab Question Bank
No ratings yet
FDS Lab Question Bank
11 pages
SYNTHETIC CLIMATE DATA GENERATION FOR IMPROVING AI^J PROJECT GROUP 5
No ratings yet
SYNTHETIC CLIMATE DATA GENERATION FOR IMPROVING AI^J PROJECT GROUP 5
6 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Machine Learning
100% (5)
Machine Learning
56 pages
ML0101EN Reg Simple Linear Regression Co2 Py v1
No ratings yet
ML0101EN Reg Simple Linear Regression Co2 Py v1
4 pages
21CSC305P Ml - Lab Programs 1 -9
No ratings yet
21CSC305P Ml - Lab Programs 1 -9
36 pages
Time Series Forecast of Electrical Load Based On XGBoost
No ratings yet
Time Series Forecast of Electrical Load Based On XGBoost
10 pages
Lab 8
No ratings yet
Lab 8
4 pages
DVA Lab Manual
No ratings yet
DVA Lab Manual
20 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
FML LabFile 7exps
No ratings yet
FML LabFile 7exps
37 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Predicting Forest Fires With Machine Learning
No ratings yet
Predicting Forest Fires With Machine Learning
4 pages
Forest Fires Application Demonstration
No ratings yet
Forest Fires Application Demonstration
4 pages
Ml Lab Manual
No ratings yet
Ml Lab Manual
36 pages
lab-5-nguyenngocmaithi-20130120
No ratings yet
lab-5-nguyenngocmaithi-20130120
20 pages
AIH_LAB1
No ratings yet
AIH_LAB1
10 pages
Lab(Revised)
No ratings yet
Lab(Revised)
4 pages
Final Stage For Cocomo Using PCA
No ratings yet
Final Stage For Cocomo Using PCA
8 pages
Final Project
No ratings yet
Final Project
14 pages
Regression Analysis
No ratings yet
Regression Analysis
16 pages
ML LAB FILE (2)
No ratings yet
ML LAB FILE (2)
48 pages
10 PDF
No ratings yet
10 PDF
12 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Sakhil Capstone
No ratings yet
Sakhil Capstone
20 pages
Pertemuan 2 (2)
No ratings yet
Pertemuan 2 (2)
18 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Algorithum-explanantion
No ratings yet
Algorithum-explanantion
6 pages
20240514_Kazadi_Joel_9213934_DLMDSPWP01
No ratings yet
20240514_Kazadi_Joel_9213934_DLMDSPWP01
18 pages
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
ML - Practical File
No ratings yet
ML - Practical File
15 pages
ML_recordjp
No ratings yet
ML_recordjp
35 pages
DL_EXP-7_16010422230
No ratings yet
DL_EXP-7_16010422230
12 pages
Ex Nested Resampling
No ratings yet
Ex Nested Resampling
4 pages
Rainfall Prediction using Machine Learning
No ratings yet
Rainfall Prediction using Machine Learning
9 pages
Answerkey
No ratings yet
Answerkey
4 pages
PST1 Solutions For Students
100% (1)
PST1 Solutions For Students
10 pages
Regression Linaire Python Tome II
No ratings yet
Regression Linaire Python Tome II
10 pages
Activity 01: Python Set/s of Source Code Use in The Activity (Paste Below)
No ratings yet
Activity 01: Python Set/s of Source Code Use in The Activity (Paste Below)
2 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
EasyChair Preprint 2297
No ratings yet
EasyChair Preprint 2297
5 pages
Image-Based Motor Imagery EEG Classification Using Convolutional Neural Network
No ratings yet
Image-Based Motor Imagery EEG Classification Using Convolutional Neural Network
5 pages
General Instructions To The Candidate
No ratings yet
General Instructions To The Candidate
4 pages
BCI DeepLearning
No ratings yet
BCI DeepLearning
9 pages
Drones: Drone-Action: An Outdoor Recorded Drone Video Dataset For Action Recognition
No ratings yet
Drones: Drone-Action: An Outdoor Recorded Drone Video Dataset For Action Recognition
16 pages
Chapter 3_P1_MSI Logic Circuit (Decoder-Encoder)
No ratings yet
Chapter 3_P1_MSI Logic Circuit (Decoder-Encoder)
67 pages
ReadingCourse-I-Durga Prasad Khatua
No ratings yet
ReadingCourse-I-Durga Prasad Khatua
33 pages
This Study Resource Was: K 61 Units
No ratings yet
This Study Resource Was: K 61 Units
4 pages
A New Method of Balancing Supercapacitors in A Series Stack Using Mosfets
No ratings yet
A New Method of Balancing Supercapacitors in A Series Stack Using Mosfets
7 pages
Smart TV Mainboard ZLS47HIS-V1 With Cannot Startup Problem Solved
No ratings yet
Smart TV Mainboard ZLS47HIS-V1 With Cannot Startup Problem Solved
6 pages
Al15 Kgdraft en Rev00 ZSP Datasheet Web
No ratings yet
Al15 Kgdraft en Rev00 ZSP Datasheet Web
2 pages
Electrolyser-Operating Manual PDF
0% (1)
Electrolyser-Operating Manual PDF
6 pages
Three-And Four-Point Method: Resistance Measurements
No ratings yet
Three-And Four-Point Method: Resistance Measurements
8 pages
INCREDIBLE PowerPoint Slide Zoom
No ratings yet
INCREDIBLE PowerPoint Slide Zoom
11 pages
Sava Tablice
No ratings yet
Sava Tablice
13 pages
Csir-Net June 2011 (Question Paper) Part A
No ratings yet
Csir-Net June 2011 (Question Paper) Part A
3 pages
Solcon USA HRVS DN MV 10 13pt8kV Spec Guide 2011
100% (1)
Solcon USA HRVS DN MV 10 13pt8kV Spec Guide 2011
10 pages
Solved Past Paper BS AD Semester 8 (Thermoanalysis Method) Chem-468
No ratings yet
Solved Past Paper BS AD Semester 8 (Thermoanalysis Method) Chem-468
11 pages
Mei S1
No ratings yet
Mei S1
55 pages
Boron Deficiency
No ratings yet
Boron Deficiency
2 pages
I Jcs Is Paper Format
No ratings yet
I Jcs Is Paper Format
3 pages
New No. 3, Old No. 2, S.V. Koil Street, Sekar Nagar, Ashok Nagar, Chennai - 60083 Phone: +91 (044 43518677 - Cell: (+91) 9789976777, 9940077338 Email
No ratings yet
New No. 3, Old No. 2, S.V. Koil Street, Sekar Nagar, Ashok Nagar, Chennai - 60083 Phone: +91 (044 43518677 - Cell: (+91) 9789976777, 9940077338 Email
4 pages
11 KV 23
No ratings yet
11 KV 23
181 pages
Thermal Dehydrocondensation of Benzene To Diphenyl in A Nonisothermal Flow Reactor
No ratings yet
Thermal Dehydrocondensation of Benzene To Diphenyl in A Nonisothermal Flow Reactor
6 pages
UEAnal. Ch-4
No ratings yet
UEAnal. Ch-4
12 pages
5807 Digital Model1
No ratings yet
5807 Digital Model1
53 pages
10 Science Imp ch10 1
No ratings yet
10 Science Imp ch10 1
10 pages
Walter Rudin
No ratings yet
Walter Rudin
4 pages
Problems Theory and Solutions in Linear Algebra
No ratings yet
Problems Theory and Solutions in Linear Algebra
169 pages
References: Sources Used
No ratings yet
References: Sources Used
4 pages
Security Management in Wireless Sensor Network (WSN)
No ratings yet
Security Management in Wireless Sensor Network (WSN)
4 pages
Punctuation S
No ratings yet
Punctuation S
28 pages
Chain Link Manual
100% (2)
Chain Link Manual
34 pages