Regression Analysis Tutorial Excel Matlab
Regression Analysis Tutorial Excel Matlab
INTRODUCTION
Regression analysis can be used to identify the line or curve which provides the best fit
through a set of data points. This curve can be useful to identify a trend in the data, whether it is
linear, parabolic, or of some other form. Regression analysis can be performed using different
methods; this tutorial will explore the use of Excel and MATLAB for regression analysis. In
addition to fitting a curve to given data, regression analysis can be used in combination with
statistical techniques to determine the validity of data points within a data set. For example, the
standard deviation for a data set can easily be determined, and any data points existing outside of
the 3σ range can be reviewed to determine if they are valid points.
Exercise A-1
In Excel, generate a plot of the seven points given in Table 1. If you are unfamiliar with
Excel, detailed instructions on how to do this are given in Appendix A.
Exercise A-2
Using all data points in the set, use Excel tools to perform a linear regression on the data. To
do this, select the graph containing the data set, then select:
Chart
Add Trendline
Type
Trend/Regression type → Linear
Options
Select Display equation on chart
Select Display R-squared value on chart
OK
The graph will resemble Fig. 1. This plot shows the original data points along with the line
providing the best fit through the points. The equation for the line is also given.
10
0
-5 0 5 10 15 20 25 30
Exercise A-3
Now create the same plot as in Exercise A-2, except remove the leading and ending data
points, and observe the changes to the linear regression, equation of the line, and R2 value.
Exercise A-4
It is apparent from the fit of the line to the original data set that a linear regression may not be
the most accurate description of the trend existing in the data. The same Excel tools can be used
to perform regressions of higher order. For this exercise, a second order regression will be
performed over the full data set. To perform a second order regression, select:
Chart
Add Trendline
Type tab
Trend/Regression type → Polynomial, Order 2
Options tab
Select Display equation on chart
Select Display R-squared value on chart
OK
Exercise A-5
Now perform a second-order curve fit, but without including the first point of the data set.
Note how this compares to the original second-order curve.
Try another second-order curve fit, but without the last point of the data set. Note the
significant effect this has on the shape of the curve.
When performing a curve fit, especially with small numbers of data points, it must be noted
that a single point can have enormous effect on the result obtained.
Exercise A-6
Next, perform a third-order regression. To do this, follow the same sequence of commands as
given in Exercise A-4, but select Polynomial, Order 3 as the Trend/Regression type. Note the
shape of the curve, the equation of the line, and the goodness of fit.
Exercise A-7
From previous exercises, it has been seen that, as the order of the regression increases, the R2
value approaches 1. Now, continue to increase the value of the order of the polynomial, as done
in exercises A-4 and A-5. At what point does the R2 value seem to reach 1?
Consider that in some cases the R2 value displayed on the chart may appear to be 1, but in
reality this is only because the number is being rounded off when it is displayed. The number of
displayed decimal places can be changed to fix this. To increase the number of decimal places,
right click on the region containing the equation and the R2 value, then select
Format data labels
Number
Category → Number
Decimal places → Enter the desired number of decimal places
Even if the R2 value equals 1, it must also be considered whether the line fit makes physical
sense. For example, and object in free-fall should have a position plot which is parabolic.
Therefore a second-order line fit is desired, even though a higher-order line might fit the points
more closely.
Exercise B-1
Plot the data set identified in Exercise A-1 in MATLAB. An example of how to do this is
given in Appendix B.
Least squares regression is used to determine the line of best fit through the data points. The
mathematical procedure for this method will now be reviewed.
Any curve which can be fit over a data set can be shown to be a function y where
( )
y = f x, a j , where j = 1, 2,…m, (1)
with j representing the number of coefficients required to create the curve of the specified order.
For example, the 3rd order equation can be expressed in the general form
yi = a1 + a 2 x + a3 x 2 + a 4 x 3 . (2)
In equation 2, i = 1, 2, … n, which represents the number of points to which this curve will be fit
(for this exercise n = 7), and a1 through a4 are the unknown aj coefficients. These coefficients can
be found using the least squares regression method and matrix algebra.
The general formula for least squares regression is
n
∂
∑
i =1
( yi − f ( xi , a1 ...a m ))
∂a j
f ( xi , a1 ...a m ) = 0 . (3)
The second half of (3) can be simplified by taking the partial derivative of the terms,
producing
∂f ∂
( xi , a1 ...a m ) = [a1 g1 + a 2 g 2 + ....a m g m ] = g i ( xi ) . (4)
∂a j ∂a j
After this partial differentiation, the general equation for least squares regression becomes
n
∑[ y
i =1
i − a j g i ( xi )]g i ( xi ) = 0 . (5)
From the general equation in (5), the general form matrix can be built,
n n
n
∑ 1 i 1 i
g ( x ) g ( x ) ... ∑ g 1 ( x i ) g m ( x i
) ∑ y i g1 ( x i )
i =1 i =1
a1 i =1
.. .. .. a = .. (6)
.. .. .. 2
..
n n ... n
∑ g m ( xi ) g1 ( xi ) ... ∑ g m ( xi ) g m ( xi ) a m ∑ y i g m ( xi )
i =1 i −1 i =1
rd
Finally, by taking the 3 order equation identified in (2) and defining the values of gi(xi) as
shown in (7), the general form of the matrix can be populated and solved using linear algebra, so
that
f ( xi , a1 ...a 4 ) = a1 g1 ( x ) + a 2 g 2 ( x ) + a3 g 3 ( x ) + a 4 g 4 ( x ) , where (7)
g1 ( x ) = 1,
g 2 ( x ) = x,
g 3 ( x ) = x 2 , and
g 4 (x) = x 3 .
For the third-order regression being performed in this case, the matrix equation to be solved
is
MATLAB can be used to solve for the unknown coefficients in (8), and to compare the resulting
coefficient values achieved from the MATLAB solution to those found using Excel. Also, plot
the solution for the line over the previously plotted data set in MATLAB. An example of a
program which can be used to do this is given in Appendix C. The resulting third-order
regression is shown in Fig. 2.
14
12 y = 0.3929x + 2.9788
R2 = 0.9286
10
8 Data set
Lower bound
6 Upper bound
0
0 5 10 15 20
Fig. 4. Use of standard deviation to determine validity of regression line or data points.
A detailed explanation of how this plot was produced can be found in Appendix E.
Note that, with this method, it is being assumed that the regression line is the “correct
answer” and the distribution of points around the line is found. Therefore, if a data point lies
outside this 3σ range, it could mean one of two things. First, the regression line could be valid,
and therefore there is a 99% chance that data point itself invalid. Second, the regression line
itself could be incorrect, and the data point is fine.
This Appendix details how to plot the data points, as required by Exercise A-1.
First, enter the data points, given in Table 1, in two consecutive columns as shown in Fig. A-
1.
x = [2;4;8;11;14;18;21];
y = [3;5;7;7.5;8;9;12];
plot(x,y,'k+')
% Set the x and y axes limits so that all of the data points
% can be clearly seen.
xlim([0 22])
ylim([0 13])
% Display a title.
title('Data points')
The MATLAB program below can be used to perform a third-order regression on a set of 7 data
points. This program implements the least squares regression method, without using any of the
MATLAB built-in regression tools. This is not the most straightforward way to perform
regression in MATLAB, but it is helpful in better understanding the theory behind the technique.
% Define A and B.
% Note that the term in the top left corner of the A matrix is equal to the
% number of data points being used, 7 in this case.
A = [7,sumx,sumx2,sumx3;sumx,sumx2,sumx3,sumx4;sumx2,sumx3,sumx4,sumx5;...
sumx3,sumx4,sumx5,sumx6];
B = [sumy; sumyx; sumyx2; sumyx3];
% Plug the found values for the coefficients into the form for the fitted
% curve (a cubic equation, in this case):
curvex=linspace(0,25,26);
for i = 1:26;
curvey(i)=Coeff(1)+Coeff(2)*curvex(i)+Coeff(3)*curvex(i)^2+Coeff(4)*...
curvex(i)^3;
end
% Create a string variable of the equation, to be used as the title for the
% plot:
equation = sprintf...
('Equation of regression line: y = %f + %fx + %fx^2 + %fx^3',...
% Plot the original data points along with the fitted curve:
plot(x,y,'k+',curvex,curvey)
title(equation)
This Appendix provides instruction on performing regression in MATLAB, using the built-in
regression tools. The method for determining the R2 value will also be covered.
The output of polyfit, p in this case, is a vector containing the coefficients of the
polynomial, starting with the highest order term. For example, the vector
p = [5 12 3 1]
represents the polynomial
5x3 + 12x2 + 3x +1
The command polyval can be used to plot the resulting polynomial. The syntax for
polyval is
f = polyval(p,x)
where p is the array containing the polynomial's coefficients, and x is the original vector of x-
values. The polynomial is therefore being evaluated at these x-values, and the result (f) is a
vector of the y-values. To plot the original data points along with the regression line, simply
enter
plot(x,y,′o′,x,f,′-′)
This will plot the original data points with small circles, and the polynomial curve fit as a
line.
To calculate the R2 value, the mean, J value and S value must first be found. The mean is
simply found using
mu = mean(y)
The J value is
J = sum((f-y).^2)
Recall that f is the polynomial evaluated at the x-values, and y contains the original y-values
of the data points. The S value is
S = sum((y-mu).^2)
2
Using these, the R value may then be calculated using
r2 = 1-J/S
This Appendix details how the plot in Fig. 4 was produced. The Excel spreadsheet used is
shown in Fig. B-1.