Lab4 MultipleLinearRegression
Lab4 MultipleLinearRegression
Singapore Polytechnic
EM0442- Artificial Intelligence and Data Analytics for Aerospace
Lab 4: Multiple Linear Regression
1 Learning Objectives
When you have completed this lab, you should be able to:
1. Understand how to perform multiple linear regression (MLR) using least squares
2. Understand how to display the results of MLR.
3. Apply the method of p-values to if independent variables included in the
regression are significant.
4. Examine the correlation matrix for our data.
Temp = [14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1,22.6,17.2]
Temp = np.array(Temp)
Income = [2680,4030,2170,4030,4900,6270,4850,7270,6500,4940,5460,5100]
Income = np.array(Income)
Sales = [215,325,185,332,406,522,412,614,544,421,445,408]
Sales = np.array(Sales)
Here we see the Temperature data being entered as a list as well as the Sales vector and now we
have an Income vector all of which need to be converted to numpy ndarrays.
If needed, a scatterplot can be drawn at this point. But for our lab it is not shown till the end of the
program.
2.2 Computation
In a similar fashion to our Simple Linear Regression example, we use a least squares method to
obtain a solution to the line of best fit that explains the data given. As before, we create a matrix
that contains the independent variable. The leftmost column of this matrix has to be a series of ones
and this time we stack on two sets of variables. This can be repeated for more variables.
def MLRegress(Temp,Income,Sales):
Z = np.ones(Temp.shape) # join two vectors
Z = np.vstack((Z, Temp)) # by vertical stacking
Z = np.vstack((Z, Income)) # by vertical stacking
Z = Z.T # array of numberdatapoints x # variables
b,resid,rank,sgl = np.linalg.lstsq(Z,Sales,rcond=None)
return b
This time, the b vector is returned such that b[2] * x2 b[1] * x1 + b[0] will be the three dimensional
(3D) plane line that gives the best fit in the LSE sense to the data vector x.
Also note that can we only visualize the results for two independent variables and one
dependent variable in 3D space.
In order to show some meaningful data, we need to plot the predicted values obtained by the least
squares procedure in a 2D representation of a 3D plot – as isometric sort of view.
3D surface (plane)
In this case, we cannot just have two 1D sets of data and expect it to be plotted for us. We are
plotting a surface and the coordinates of all the points on that surface needs to be generated.
1_ The values of all x and y coordinates need to be generated first using the arange() function.
These values are purely for plotting, so they need not be in the original set of data. These data
vectors are xmesh and ymesh.
2_ These vectors are passed to the meshgrid function which now generates a grid of data.
In other words, the xmesh(Temp) vector is now duplicated by the number of elements in
the ymesh(Income) vector. And similarly for the other way round.
3_ Now all the z axis data needs to be generated in zmesh, in this case by the line:
So that zmesh contains the predicted Sales values, so that a surface can be plotted.
This is done by comparing the computed (zsb = b[0] + b[1]*xs[i] + b[2]*ys[i]) with the actual
value (zs[i]) as shown below.
3 Backward elimination
We will now use the p-value to identify the significant independent variables. We will use the
statsmodels.api library and use only a few of its functions.
Modify the previous program by importing the library. You may either modify or copy the code in
the MLRegress function to another function MLRegressOLS to easily compare results. Add in the lines
in BOLD which will give you the summary statistics. Also, add: from statsmodels.api
import sm.
b,resid,rank,sgl = np.linalg.lstsq(Z,Sales,rcond=None)
est = sm.OLS(Sales, Z)
est2 = est.fit()
print(est2.summary())
return b
Run the program again and look at the output of the OLS function.
Compare your result with that in the lecture slides. Note that the correlation coefficients are all
above 0.95 and yet one of the independent variables are not useful.
A delivery company trying to predict the travel time for his drivers. To conduct an analysis 10
random samples are collected from past trips are listed below.
44 1 3.57 4.8
77 3 3.57 6.4
80 3 3.03 7
66 2 3.51 5.6
109 5 3.54 7.3
76 3 3.25 6.4
i) Check the correlation between dependent and each independent variable independently.
ii) Investigate independent variables, is there a collinearity?
iii) Drop the variable with weak correlation and conduct MLR analysis. What is your
conclusions?