Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture 11 Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Computational Methods in

Chemical Engineering

LECTURE 11
LINEAR AND NONLINEAR REGRESSION
CURVE FITTING

ÖZGE KÜRKÇÜOĞLU-LEVİTAS
Linear Least-Squares Regression

 Derive an approximating function that fits the shape or


general trend of the data without necessarily matching the
individual points.
 One approach is least-squares method.
 Simplest case is fitting a straight line:

y= 𝑎0 + 𝑎1 𝑥 + 𝑒
 a0: intercept
 a1: slope
 e: error or residual e=y−𝑎0 − 𝑎1 𝑥
Discrepancy between the true value of y
and the approximate value a0+a1x
predicted by linear equation
 Best line through the data would be to minimize the sum of
the squares of the residuals (errors).
𝑛
2
𝑆𝑟 = ෍ 𝑦𝑖 − 𝑎0 − 𝑎1 𝑥𝑖
𝑖=1

n: total number of point

yi

xi
𝑛

𝑆𝑟 = ෍ 𝑦𝑖 − 𝑎0 − 𝑎1 𝑥𝑖 2

𝑖=1

 To determine a0 and a1:

= 0 minimize Sr

= 0 minimize Sr
 Here,

 Then:
2 equations with 2 unknowns:
Called Normal equations

x- and y- are mean values


Example: fit a straight line to data

y= 𝑎0 + 𝑎1 𝑥
Calculate the means

y= −234.3 + 19.5𝑥
y

y= −234.3 + 19.5𝑥
Quantification of Error of Linear Regression
𝑛

 Sum of squares: 𝑆𝑟 = ෍ 𝑦𝑖 − 𝑎0 − 𝑎1 𝑥𝑖 2

𝑖=1 𝑛
 Discrepancy between the data and the mean 𝑆𝑡 = ෍ 𝑦𝑖 − 𝑦ത 2

𝑖=1

Standard deviation for the regression line:

Standard error of estimate


Standard deviation for the regression line:

Spread of the data around the mean

Linear regression is better


Small residual error Large residual error

Correlation coefficient r
Coefficient of determination r2

A perfect fit:
Sr = 0 and r2 = 1,
 line explains 100% of
the variability of the data.
 IMPORTANT!
 Just because r2 is close to 1 does not mean that the fit is
necessarily good. It is possible to obtain a high r2 for x and
y that are not linear.

Anscombe’s four data sets along with the best-fit line, y = 3 + 0.5x
Linearization of Nonlinear Relationships

lny=lna1+b1x 1/y=1/a3+b3/a3/x
logy=loga2+b2logx
 exponential model

 power model

 saturation-growth-rate model
 The relationship between x and y is not always
linear.
 1st step in any regression analysis is: plot and visually
inspect data.
Example:

 Fit power equation to the data by making a logarithmic


transformation
linearization

 Mean values
Fit of the transformed data

logy=loga2+b2logx

The power equation


How to solve in MATLAB?

>> x = [10 20 30 40 50 60 70 80];


>> y = [25 70 380 550 610 1220 830 1450];

>> [r,m,b] = regression(x,y)


r=
0.9383 %Also calculate r2= 0.8804
m=
19.4702 %a1
b=
-234.2857 %a0

>> plotregression(x,y)

y= −234.3 + 19.5𝑥
 OR, fit a linear logarithmic equation by,
>> [r,m,b] = regression(log10(x),log10(y))
r=
0.9737
m=
1.9842
b=
-0.5620
function [a, r2] = linregr(x,y)
% linregr: linear regression curve fitting
% [a, r2] = linregr(x,y): Least squares fit of straight
% line to data by solving the normal equations
% input:
% x = independent variable
% y = dependent variable
% output:
% a = vector of slope, a(1), and intercept, a(2)
% r2 = coefficient of determination
n = length(x);
if length(y)~=n, error('x and y must be same length'); end
x = x(:); y = y(:); % convert to column vectors
sx = sum(x); sy = sum(y);
sx2 = sum(x.*x); sxy = sum(x.*y); sy2 = sum(y.*y);
a(1) = (n*sxy-sx*sy)/(n*sx2-sx^2);
a(2) = sy/n-a(1)*sx/n;
r2 = ((n*sxy-sx*sy)/sqrt(n*sx2-sx^2)/sqrt(n*sy2-sy^2))^2;
% create plot of data and best fit line
xp = linspace(min(x),max(x),2);
yp = a(1)*xp+a(2);
plot(x,y,'o',xp,yp)
Chapra 3rd ed.
grid on
 Built-in function polyfit fits a least-squares nth order polynomial to data
as,
>> p = polyfit(x, y, n)
For our example n=1 since straight line is a 1st order polynomial.

>> x = [10 20 30 40 50 60 70 80];


>> y = [25 70 380 550 610 1220 830 1450];

>> a=polyfit(x,y,1)
a=
19.4702 -234.2857 %y= −234.3 + 19.5𝑥

>>z=polyval(a,45) % evaluate y at x=45 using the coefficients in a


z=
641.8750
Function:
 power y = bxm
 exponential y = bemx or y = b10mx
 logarithmic y = mln(x)+b or y = mlog(x)+b
 reciprocal y = 1/(mx+b)

First rewrite the functions in a form that can be fitted


with a linear polynomial (n=1)
y = mx+n

 power ln(y) = mln(x)+ln(b)


 exponential ln(y) = mx+ln(b) or log(y) = mx+log(b)
 reciprocal 1/y = mx+b
For a given data it is possible to foresee which of the functions
has the potential for providing a good fit. This is done by
plotting the data using different combinations of linear &
logarithmic axes.
x-axis y-axis Function
linear linear y=mx+b

logarithmic logarithmic y=bxm

linear logarithmic y=bemx OR y=b10mx

logarithmic linear y=mln(x)+b OR y=mlog(x)+b

linear linear (plot 1/y) y=1/(mx+b)


Function polyfit
power y=bxm polyfit(log(x), log(y), 1)

exponential y= bemx polyfit(x, log(y), 1)


y=b10mx polyfit(x,log10(y),1)
logarithmic y = m ln(x) + b polyfit(log(x), y, 1)
y=mlog(x)+b polyfit(log10(x),y,1)
reciprocal y = 1/(mx+b) polyfit(x,1./ y, 1)
Other considerations

 Exponential functions cannot pass through the origin


 Exponential functions can only fit data with all positive y’s
or all negative y’s
 Logarithmic functions cannot model x=0, or negative
values of x
 For the power function y=0 when x=0
 The reciprocal equation cannot model y=0
Example:

t 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

w 6.00 4.83 3.70 3.15 2.41 1.83 1.49 1.21 0.96 0.73 0.64

5
• Data is first plotted with
linear scales on both axis.
4

w 3
• Power function ?
2 • Logarithmic function ?
1 • Reciprocal or exponential ?
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t
Linear-linear
power:
w=btm ------ polyfit(log(t), log(w), 1)

logarithmic
w = m ln(t) + b ---- polyfit(log(t), w, 1)
w=mlog(t)+b ---- polyfit(log10(t),w,1)

reciprocal
w = 1/(mt+b) ------polyfit(t,1./ w, 1) --- 1/y = mx+b

exponential
w= bemt -----polyfit(t, log(w), 1) ---- ln(w) = mt+ln(b)
x Linear- y logarithmic x Linear – y reciprocal
1.6

More or less linear 1.4

1.2
Not linear
exp 1

0.8

0.6

0
10 0.4

0.2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

>> p = polyfit(t,log(w),1); %exp form 5


Fits well
>> m = p(1); 4

>> b = exp(p(2));
3
>> tc = 0:0.1:5;
>> wc = b*exp(m*tc); 2

>> plot(t,w,'o',tc,wc) 1

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
General Linear Least-Squares and
Nonlinear Regression

1. Exponential Model:
𝑦 = 𝑎𝑒 𝑏𝑥

For this case, the sum of the squares of the residuals is,

Differentiate Sr with respect to a and b


 Rearrange to obtain:

From 1st equation:

b can be found by
numerical methods
(such as bisection)
Example:

Many patients get concerned when a test involves injection of a


radioactive material.
For example for scanning a gallbladder, a few drops of Technetium-99m
isotope is used. Half of the Technetium-99m would be gone in about 6
hours. It, however, takes about 24 hours for the radiation levels to reach
what we are exposed to in day-to-day activities.
Below is given the relative intensity of radiation as a function of time.

Use the exponential model


Find:
γ= 𝐴𝑒 λ𝑡 The value of the regression constants A and λ
 Plot data first:
>> t=[0 1 3 5 7 9];
>> gamma=[1. 0.891 0.708 0.562 0.447 0.355 ];
>> plot(t,gamma,'o')

γ = 𝐴𝑒 λ𝑡
λ is found by solving the nonlinear equation

A found from

λ = -0.1151 A=0.9998

γ = 0.9998𝑒 −0.1151𝑡
>> t=[0 1 3 5 7 9];
>> gamma=[1. 0.891 0.708 0.562 0.447 0.355 ];
>> plot(t,gamma,'o')
>> hold on
>> x=0:24;
>> plot(x,0.9998*exp.(-0.1151.*x))

γ = 0.9998𝑒 −0.1151𝑡

@24 h,
γ=6.31 x 10-2

After 24 hours, only 6.3% of


radioactive material is left
2. Polynomial Regression:

Suppose that we fit a 2nd order polynomial (or quadratic)


𝑦 = 𝑎0 + 𝑎1 x+𝑎2 𝑥 2 + 𝑒

For this case, the sum of the squares of the residuals is,
 To generate the least-squares fit, we take the derivatives of Sr with
respect to each unknown coefficients of the polynomial,

 Then, these equations can be set to zero and rearranged,

3 unknowns, solve for a0,


a1 and a2
 For mth order polynomial,

 Standard error,
Example:

 Fit a second-order polynomial to the data


 Simultaneous linear equations:

 Use MATLAB
>> N = [6 15 55;15 55 225;55 225 979];
>> r = [152.6 585.6 2488.8];
>> a = N\r
a =
2.4786
2.3593
1.8607

Standard error of the estimate

Coefficient of determination
3-General Linear Least Squares
𝑦 = 𝑎0 𝑧0 + 𝑎1 𝑧1 +𝑎2 𝑧2 +𝑎3 𝑧3 +…+𝑎𝑚 𝑧𝑚 +e *

z0, z1, …, zm: m+1 basis functions

For simple linear regression: z0= 1, z1 = x


For polynomial regression: z0= 1, z1 = x, z2 = x2 … zm = xm

Equation (*) in matrix notation:

𝑦 = 𝑍 𝑎 + 𝑒
𝑦 = 𝑍 𝑎 + 𝑒
[Z] is a matrix of the calculated values of the
basis functions at the measured values of the
independent variables

m: number of variables
n: number of data points
n≥ 𝑚 + 1 : Z may not be a square matrix

observed values of the dependent variable

unknown coefficients

residuals

sum of the squares of the residuals


Minimize Sr by taking its partial derivative with
respect to each of the coefficients then set the
resulting equation equal to zero

Normal equations:

coefficient of determination
Example:

 Fit a second order polynomial to the data

𝑦 = 𝑎0 + 𝑎1 𝑥+𝑎2 𝑥 2
Use MATLAB
>> x = [0 1 2 3 4 5]';
>> y = [2.1 7.7 13.6 27.2 40.9 61.1]';

% Create the Z matrix 𝑦 = 𝑎0 + 𝑎1 𝑥+𝑎2 𝑥 2


>> Z = [ones(size(x)) x x.^2] 1 𝑥 𝑥2
Z =
1 0 0
1 1 1
1 2 4
1 3 9
1 4 16
1 5 25
% [Z]T[Z] results in the coefficient matrix for the normal
equations
>> Z'*Z
ans =
6 15 55
15 55 225
55 225 979
% solve for the coefficients of the least-squares
>> a = (Z'*Z)\(Z'*y)
ans =
2.4786 %a0
2.3593 %a1
1.8607 %a2 𝑦 = 2.4786 + 2.3593𝑥+1.8607𝑥 2

%compute the sum of the squares of the residuals


>> Sr = sum((y-Z*a).^2)
Sr =
3.7466

%r2 can be computed


>> r2 = 1-Sr/sum((y-mean(y)).^2)
r2 =
0.9985
% Sy/x standard error can be computed
>> syx = sqrt(Sr/(length(x)-length(a)))
syx =
1.1175
 OR
>> x = [0 1 2 3 4 5]';
>> y = [2.1 7.7 13.6 27.2 40.9 61.1]';
>> Z = [ones(size(x)) x x.^2];
>> a = polyfit(x,y,2)
a=
1.8607 2.3593 2.4786

 OR 𝑦 = 2.4786 + 2.3593𝑥+1.8607𝑥 2
>> x = [0 1 2 3 4 5]';
>> y = [2.1 7.7 13.6 27.2 40.9 61.1]';
>> Z = [ones(size(x)) x x.^2];
>> a = Z\y
a=
2.4786 2.3593 1.8607
4-Nonlinear regression:
There are many cases in engineering and science where
nonlinear models must be fit to data. These models have a
nonlinear dependence on their parameters, such as,
𝑦 = 𝑎0 1 − 𝑒 −𝑎1𝑥 + 𝑒

This equation cannot be manipulated into a linear form.

However, coefficients may be found using optimization


techniques to directly determine the least-squares fit.
 An objective function to compute the sum of squares:

f(𝑎0 , 𝑎1 ) = σ𝑛𝑖=1(𝑦𝑖 − 𝑎0 1 − 𝑒 −𝑎1𝑥 )2

An optimization routine can be used to determine a0 and a1


that minimize the function.
MATLAB’s fminsearch built-in function can do that
optimization.
[x, fval] = fminsearch(fun,x0,options,p1,p2,...)

x = a vector of the values of the parameters that minimize the


function fun, fval = the value of the function at the minimum,
x0 = a vector of the initial guesses for the parameters,
options = a structure containing values of the optimization
parameters as created with the optimset function
p1, p2, etc. = additional arguments
Example:

 Fit the power model to data

Previously, we fit the power model to


data from by linearization using
logarithms, and found:

Now use nonlinear regression.


 Let’s create an M-file function to compute the sum of the
squares. Call it fSSR.m,
function f = fSSR(a,xm,ym)
yp = a(1)*xm.^a(2);
f = sum((ym-yp).^2);
 In the command line,
>> x = [10 20 30 40 50 60 70 80];
>> y = [25 70 380 550 610 1220 830 1450];

The minimization of the function is then implemented by


>> fminsearch(@fSSR, [1, 1], [], x, y)
ans =
2.5384 1.4359
y

Difficult to tell which model describes the data best

You might also like