Supervised Learning - Regression - Annotated
Supervised Learning - Regression - Annotated
Supervised Learning - Regression - Annotated
Introductory
Artificial Intelligence
College of Engineering
Supervised Learning:
Regression
Classical Machine Learning
Supervised
Learning
What is Supervised Learning?
Definition
Regression
Classification
Linear Regression:
• When we believe the function is a straight line
• The basic idea in linear regression is to add up the effects of each of the
feature variables to produce the predicted value.
Nonlinear Regression:
• When we believe the function is not a line
• For example, the function is curved like a polynomial
Linear
Regression
What is Linear Regression - Simplified
Thinking of linear regression as a shopping bill
Y = AX + B
y = mx + b
The coefficient A would represent the expected blood pressure when dosage is zero.
The coefficient B would represent the average change in blood pressure when dosage is increased by
one unit.
If B is negative, it would mean that an increase in dosage is associated with a decrease in blood
pressure.
If B is close to zero, it would mean that an increase in dosage is associated with no change in blood
pressure.
If B is positive, it would mean that an increase in dosage is associated with an increase in blood
pressure.
Depending on the value of B, researchers may decide to change the dosage given to a patient.
Linear Regression
Example #3
Crop yield = A + B(amount of fertilizer ~ X1) + C(amount of water ~ X2)
The coefficient A would represent the expected crop yield with no fertilizer or water.
The coefficient B would represent the average change in crop yield when fertilizer is increased by one
unit, assuming the amount of water remains unchanged.
The coefficient C would represent the average change in crop yield when water is increased by one
unit, assuming the amount of fertilizer remains unchanged.
Depending on the values of B and C, the scientists may change the amount of fertilizer and water used
to maximize the crop yield.
Linear Regression
Example #4
Points Scored = A + B(yoga sessions~ X1) + C(weightlifting sessions~X2)
The coefficient A would represent the expected points scored for a player who participates in zero
yoga sessions and zero weightlifting sessions.
The coefficient B would represent the average change in points scored when weekly yoga sessions is
increased by one, assuming the number of weekly weightlifting sessions remains unchanged.
The coefficient C would represent the average change in points scored when weekly weightlifting
sessions is increased by one, assuming the number of weekly yoga sessions remains unchanged.
Depending on the values of B and C, the data scientists may recommend that a player participates in
more or less weekly yoga and weightlifting sessions in order to maximize their points scored.
Evaluation
Regression
Loss Functions
In Machine Learning, our main goal is to minimize the error which is defined by the
Loss Function.
Loss Functions
Root Mean
Sum of Errors Sum of Absolute Sum of Squared Mean Squared
Squared Errors
(SE) Errors (SAE) Errors (SSE) Errors (MSE)
(RMSE)
Linear Regression
Sum of Errors (SE)
Error will be the difference in the predicted value and the actual value. (X=5,Y_actual=7)
(X=5,Y_predicted= 2(X)+ 10 = 20) comes from data/table
From the line (AI) Golden Truth, Actual, True
= 13
SE = (40 – 45) + (50 – 50) + (65 – 60)
SE = -5 + 0 + 5 = 0 (Loss) Misleading me to believe my AI is great when its
not
Linear Regression
Sum of Absolute Errors (SAE)
Take the absolute values of the errors for all iterations.
𝑦 = 𝐴+ 𝐵𝑥
Linear Regression - Optimization
Linear Regression # X (Age) Y (Cats) XY
Example 1 25 2 50 625 4
2 30 2 60 900 4
3 19 1 19 361 1
4 5 1 5 25 1
5 80 5 400 6400 25
6 70 6 420 4900 36
7 65 4 260 4225 16
8 28 2 56 784 4
A = 0.29344962 9 42 3 126 1764 9
10 39 3 117 1521 9
11 12 2 24 144 4
12 55 4 220 3025 16
13 13 1 13 169 1
14 45 2 90 2025 4
15 22 1 22 484 1
Sum 550 39 1882 27352 135
B = 0.0629059
Linear Regression # X (Age) Y (Cats) XY
Example 1 25 2 50 625 4
2 30 2 60 900 4
3 19 1 19 361 1
4 5 1 5 25 1
A = 0.29344962 5 80 5 400 6400 25
6 70 6 420 4900 36
B = 0.0629059 7 65 4 260 4225 16
8 28 2 56 784 4
9 42 3 126 1764 9
10 39 3 117 1521 9
y = A + Bx 11 12 2 24 144 4
12 55 4 220 3025 16
y = 0.293 + 0.0629x 13
14
13
45
1
2
13
90
169
2025
1
4
15 22 1 22 484 1
Sum 550 39 1882 27352 135
Goodness of Fit:
Overfitting,
Underfitting, and
Generalization
Linear Regression – 3. Evaluation
Linear Regression Y (Cats) y
# X (Age)
(Actual) (Predicted)
Example 1 25 2 1.8655 0.01809
2 30 2 2.18 0.0324
y = A + Bx 3 19 1 1.4881 0.238242
4 5 1 0.6075 0.154056
y = 0.293 + 0.0629x 5 80 5 5.325 0.105625
6 70 6 4.696 1.700416
7 65 4 4.3815 0.145542
8 28 2 2.0542 0.002938
MSE = 0.344435 9 42 3 2.9348 0.004251
10 39 3 2.7461 0.064465
11 12 2 1.0478 0.906685
12 55 4 3.7525 0.061256
RMSE = sqrt(0.344435) = 0.586885849 13 13 1 1.1107 0.012254
RMSE 0.59 14 45 2 3.1235 1.262252
15 22 1 1.6768 0.458058
Mean 0.344435
Overfitting
Which one is the best?
Age Versus Cat Ownership Age Versus Cat Ownership Age Versus Cat Ownership
7
7 7
6
6 6
5
5 5
Number of Cats
Number of Cats
Number of Cats
4
4 4
3
3 3
2
2 2
1
1 1
0
0 10 20 30 40 50 60 70 80 90 0 0
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Age (yrs) Age (yrs) Age (yrs)
Is it the one with the best fit to the data?
Good Fit Underfitting Overfitting
Age Versus Cat Ownership Age Versus Cat Ownership Age Versus Cat Ownership
7
7 7
6
6 6
5
5 5
Number of Cats
Number of Cats
Number of Cats
4
4 4
3
3 3
2
2 2
1
1 1
0
0 10 20 30 40 50 60 70 80 90 0 0
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Age (yrs) Age (yrs) Age (yrs)
Goodness of Fit
How well is it going to predict future data?
Age Versus Cat Ownership Age Versus Cat Ownership Age Versus Cat Ownership
7
7 7
6
6 6
5
5 5
Number of Cats
Number of Cats
Number of Cats
4
4 4
3
3 3
2
2 2
1
1 1
0
0 10 20 30 40 50 60 70 80 90 0 0
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Age (yrs) Age (yrs) Age (yrs)
Generalization
What is Generalization
• For any real-world problem with inputs and output data, we can
map these inputs to the output using a function
• The goal of supervised machine learning model is to produce a
model that
• understand the function between input and output for
training data but also
• generalize this function so that it can work with new
unseen data with good accuracy
Generalization
Example
What is Model Prediction Error?
• High Model Prediction Error means the model has created a function that fails to
understand the relationship between input and output data
• Low Model Prediction Error means the model has created a function that has
understood the relationship between input and output data
Good Fit I solve problems. Many times I get I go to the exam. I believed at home I
(Our Target) the right answers. Sometimes I do can score 80 from my own assessment
some errors. and I get 78 in the test
Low Model Prediction Error Low Performance Variance
Overfitting I memorized 10 problems and their I go to the exam. I thought I will get
Dangerous (Misleading. I can solutions. I do 0 mistakes in any of 100%. I got 60 because I saw problems I
Because discover this after the these problems did not memorize
Undetectable product is out) Low Model Prediction Error High Performance Variance
Underfiting I seem to have studied and I go to the exam and do the same kind
(I can discover this developed some picture. But I still of performance I do at home (same
Not a big deal and solve it before do a lot of mistakes. I think I will get mistakes) I get 58%
Since it is product is out) 60% High Model Prediction Error Low Performance Variance
Detectable
Overfitting & Underfitting
How to detect them? How to handle them?
• Overfitting introduces the problem of high
variance and underfitting results in high
Handle Handle
model prediction error and thus resulting Overfitting Underfitting
in a bad model.
• We can identify these models during Increase Increase
training and testing phase itself. Training Data Training Data
• If a model is showing high accuracy
during the training phase but fails to Reduce Increase
show similar accuracy during the Model Complexity
testing phase it indicates overfitting Complexity of Model
• If a model fails to show satisfactory
accuracy during the training phase Remove
itself it means the model is Noise from
underfitting. Data
Overfitting vs. Underfitting
Example: A class consisting of 3 students & a Professor
Overfitting vs. Underfitting
Example
Overfitting vs. Underfitting
Example
Overfitting vs. Underfitting
Example
Overfitting vs. Underfitting
Example
Overfitting vs. Underfitting
Example
Overfitting vs. Underfitting
Example
Overfitting vs. Underfitting
Example
Overfitting vs. Underfitting
Detection through Visual Inspection
Cross Validation
for Regression
Cross Validation for Regression
Techniques
Leave
K-Fold
One Out
Test Set Cross
Cross
Method Validatio
Validatio
n
n
Test Set Method
Cross Validation Methods
Test Set Method
Training
4 5 1
2. The remainder is a training set. 5 80 5
3. Perform your regression on the training set. 6 70 6
7 65 4
4. Estimate your future performance with the test set. 8 28 2
9 42 3
10 39 3
11 12 2
12 55 4
13 13 1
14 45 2
Test
15 22 1
Cross Validation Methods
Y
Test Set Method # X (Age)
(Cats)
XY
1 25 2 50 625 4
2 30 2 60 900 4
A = 0.504657584 3 19 1 19 361 1
4 5 1 5 25 1
B = 0.061322329
Training
5 80 5 400 6400 25
6 70 6 420 4900 36
7 65 4 260 4225 16
y = A + Bx 8 28 2 56 784 4
9 42 3 126 1764 9
y = 0.505 + 0.0613x 10 39 3 117 1521 9
11 12 2 24 144 4
12 55 4
13 13 1
14 45 2
15 22 1
Test
Cross Validation Methods # X (Age) Y (Cats) y (Predicted)
Test Set Method 1 25 2
2 30 2
3 19 1
A = 0.504657584 4 5 1
5 80 5
Training
B = 0.061322329 6 70 6
7 65 4
8 28 2
y = A + Bx 9
10
42
39
3
3
y = 0.505 + 0.0613x 11
12
12
55
2
4 3.877386 0.015034
13 13 1 1.301848 0.091112
14 45 2 3.264162 1.598107
15 22 1 1.853749 0.728887
Test
Mean Squared Error (MSE) 0.608285
Cross Validation Methods
Test Set Method – the Right Way # X (Age) Y (Cats)
2 30 2
1. Randomly choose 30% of the data to be in a test set. 9 42 3
Training
4 5 1
2. The remainder is a training set. 7 65 4
3. Perform your regression on the training set. 6 70 6
15 22 1
4. Estimate your future performance with the test set.
5 80 5
8 28 2
13 13 1
11 12 2
3 19 1
1 25 2
Test
12 55 4
10 39 3
14 45 2
Test Set Method
Pros & Cons
Very Simple
Pros
Can then simply choose the method with the best test-set
score
Test
2 30 2
1. Split the data into 5 groups. 3 19 1
2. For each unique group: 4 5 1
5 80 5
1. Take the group as a hold out or test data set 6 70 6
2. Take the remaining groups as a training data 7 65 4
set 8 28 2
Training
3. Perform your regression on the training set 9 42 3
and evaluate it on the test set 10 39 3
11 12 2
12 55 4
13 13 1
14 45 2
15 22 1
Fold #1
Cross Validation Methods # X (Age) Y (Cats) y (Predicted)
5-Fold Cross Validation 1 25 2 1.932722 0.004526
2 30 2 2.239749 0.057479
3 19 1 1.564291 0.318424
A = 0.39759 4 5 1
B = 0.061405 5 80 5
6 70 6
7 65 4
8 28 2
y = A + Bx 9
10
42
39
3
3
11 12 2
y = 0.39759 + 0.061405x 12 55 4
13 13 1
14 45 2
15 22 1
Mean Squared Error (MSE) 0.126809
Fold #2
Cross Validation Methods # X (Age) Y (Cats) y (Predicted)
5-Fold Cross Validation 1 25 2
2 30 2
3 19 1
A = 0.41913 4 5 1 0.697237 0.091666
5 80 5 4.868839 0.017203
B = 0.055621 6 70 6 4.312626 2.847232
7 65 4
8 28 2
9 42 3
y = A + Bx 10 39 3
11 12 2
y = 0.41913 + 0.055621x 12 55 4
13 13 1
14 45 2
15 22 1
Mean Squared Error (MSE) 0.985367
Cross Validation Methods Fold #2
5-Fold Cross Validation # X (Age) Y (Cats)
Test Training
1 25 2
2 30 2
1. Split the data into 5 groups. 3 19 1
2. For each unique group: 4 5 1
5 80 5
1. Take the group as a hold out or test data set 6 70 6
2. Take the remaining groups as a training data 7 65 4
set 8 28 2
3. Perform your regression on the training set 9 42 3
and evaluate it on the test set 10 39 3
11 12 2
Training
12 55 4
13 13 1
14 45 2
15 22 1
Cross Validation Methods Fold #3
5-Fold Cross Validation #
1
X (Age) Y (Cats)
25 2
2 30 2
Training
1. Split the data into 5 groups. 3 19 1
2. For each unique group: 4 5 1
5 80 5
1. Take the group as a hold out or test data set 6 70 6
2. Take the remaining groups as a training data 7 65 4
set 8 28 2
3. Perform your regression on the training set 9 42 3
Test
and evaluate it on the test set 10 39 3
11 12 2
12 55 4
13 13 1
Training
14 45 2
15 22 1
Fold #3
Cross Validation Methods # X (Age) Y (Cats) y (Predicted)
5-Fold Cross Validation 1 25 2
2 30 2
3 19 1
A = 0.264577 4 5 1
5 80 5
B = 0.064639 6 70 6
7 65 4 4.466095 0.217244
8 28 2 2.074462 0.005545
y = A + Bx 9
10
42
39
3
3
2.979404 0.000424
11 12 2
y = 0.264577 + 0.064639x 12 55 4
13 13 1
14 45 2
15 22 1
Mean Squared Error (MSE) 0.074404
Cross Validation Methods Fold #4
5-Fold Cross Validation #
1
X (Age) Y (Cats)
25 2
2 30 2
1. Split the data into 5 groups. 3 19 1
2. For each unique group: 4 5 1
5 80 5
Training
1. Take the group as a hold out or test data set 6 70 6
2. Take the remaining groups as a training data 7 65 4
set 8 28 2
3. Perform your regression on the training set 9 42 3
and evaluate it on the test set 10 39 3
11 12 2
12 55 4
Training Test
13 13 1
14 45 2
15 22 1
Fold #4
Cross Validation Methods # X (Age) Y (Cats) y (Predicted)
5-Fold Cross Validation 1 25 2
2 30 2
3 19 1
A = 0.060635 4 5 1
5 80 5
B = 0.065929 6 70 6
7 65 4
8 28 2
y = A + Bx 9 42 3
10 39 3 2.631858 0.135529
y = 0.505 + 0.0613x 11
12
12
55
2
4
0.851781
3.686718
1.318408
0.098146
13 13 1
14 45 2
15 22 1
Mean Squared Error (MSE) 0.517361
Cross Validation Methods Fold #5
5-Fold Cross Validation #
1
X (Age) Y (Cats)
25 2
2 30 2
1. Split the data into 5 groups. 3 19 1
Training
2. For each unique group: 4 5 1
5 80 5
1. Take the group as a hold out or test data set 6 70 6
2. Take the remaining groups as a training data 7 65 4
set 8 28 2
3. Perform your regression on the training set 9 42 3
and evaluate it on the test set 10 39 3
11 12 2
12 55 4
13 13 1
14 45 2
15 22 1
Test
Fold #5
Cross Validation Methods # X (Age) Y (Cats) y (Predicted)
5-Fold Cross Validation 1 25 2
2 30 2
3 19 1
A = 0.50274 4 5 1
5 80 5
B = 0.061632 6 70 6
7 65 4
8 28 2
9 42 3
y = A + Bx 10 39 3
11 12 2
y = 0.50274 + 0.061632x 12
13
55
13
4
1 1.303958 0.092391
14 45 2 3.276188 1.628655
15 22 1 1.858648 0.737276
Mean Squared Error (MSE) 0.81944
Cross Validation Methods
X Y
Overall Test MSE #
(Age) (Cats)
1 25 2
2 30 2 0.127
4 4
LOOCV
Activity
A = -1.5 Number Actual Predicted
of Number Number
B = 1.6 students of Books of Books
Y y
5 6 6.5 0.5
y = A + Bx 3 4
y = -1.5 + 1.6 x 7 10
6 8
4 4
LOOCV
Activity
A = -4 Number Actual Predicted
of Number Number
B=2 students of Books of Books
Y y
5 6
y = A + Bx
3 4 2 2
y = -4 + 2 x
7 10
6 8
4 4
LOOCV
Activity
A = -1.6 Number Actual Predicted
of Number Number
B = 1.6 students of Books of Books
Y y
5 6
y = A + Bx
3 4
y = -1.6 + 1.6x
7 10
6 8 8 0
4 4
LOOCV
Activity
A = -0.8 Number Actual Predicted
of Number Number
B = 1.4 students of Books of Books
Y y
5 6
y = A + Bx
3 4
y = -0.8 + 1.4 x
7 10 9 1
6 8
4 4
LOOCV
Activity
Number of Actual Predicted
students Number Number
of Books of Books
Y y
5 6 6.5 0.5
3 4 2 2
7 10 9 1
RMSE Mean 0.928
6 8 8 0 RMSE Std 0.749
4 4 5.14 1.14
LOOCV
Activity
Number Actual Predicted
A = -0.8 of Number Number
students of Books of Books
B = 1.48 Y y
5 6
y = A + Bx 3 4
y = -0.8 + 1.48x 7 10
6 8
4 4 5.14 1.14
Cross Validation Methods
LOOCV
For k=1 to N
1. Let be the kth record
2. Temporarily remove from the dataset.
3. Train on the remaining N-1 data points
4. Note your error
Leave One Out Expensive Doesn’t waste data Very limited data
available
K-Fold Less Expensive that LOOCV Doesn’t waste data Somewhere in
between
Nonlinear
Regression
Nonlinear Regression
Popular nonlinear regression models
Exponential Model
𝑦 = 𝑎 ⅇ𝑏𝑥
Power Model
𝑏
𝑦 =𝑎 𝑥
Saturation Growth Model
𝑎𝑥
𝑦 =
𝑏 +𝑥
Polynomial Model
𝑦 =𝑎 0 + 𝑎 1 𝑥 + … + 𝑎 𝑚 𝑥 𝑚
Regression Models
Advantages & Disadvantages
Regression Model Advantages Disadvantages
Linear Regression • Works well irrespective of the dataset • The assumptions of linear
size regression
• Gives information about the relevance • Linear Regression is
of features susceptible to over-fitting
• Linear Regression is simple to
implement and easier to interpret the
output coefficients.
Polynomial Regression • Works on any size of the dataset • We need to choose the right
• Works very well on nonlinear problems polynomial degree for good
model prediction error/
variance tradeoff
Breakout Session
Linear Regression
Class Activity
Suppose that an extensive study is carried out, and it is found that in a particular country, the life
expectancy (the average number of years that people live) among non-smoking women who don't eat any
vegetables is 80 years. Suppose further that on the average, men live 5 years less. Also take the numbers
mentioned above: every cigarette per day reduces the life expectancy by half a year, and a handful of
veggies per day increases it by one year.
Calculate the life expectancies for the following example cases:
For example, the first case is a male (subtract 5 years), smokes 8 cigarettes per day (subtract 8 × 0.5 = 4
years), and eats two handfuls of veggies per day (add 2 × 1 = 2 years), so the predicted life expectancy is
80 - 5 - 4 + 2 = 73 years.
7 10
6 8
4 4
5 7
Linear Regression
Activity Students Books Students * Books
x y xy
5 6 = 25 5*6 = 30
3 4 =9 3*4 = 12
7 10 = 49 7*10 = 70
6 8 = 36 6*8 = 48
4 4 = 16 4*4 = 16
-1.5
5 7 = 25 5*7 = 35
30 39 160 211
5 6.5 n=6
Linear Regression
Activity
Students Books Students * Books
SumX SumXsquared SumY SumXY 469
x y xy
10241 5624 107991
5 6 = 25 5*6 = 30
3 4 =9 3*4 = 12
7 10 = 49 7*10 = 70
6 8 = 36 6*8 = 48
4 4 = 16 4*4 = 16
1.6
5 7 = 25 5*7 = 35
30 39 160 211
5 6.5 n=6