ML Supervised Learning SNU
ML Supervised Learning SNU
and
Supervised Learning
Kritanta Saha
Assistant Professor
Dept. of Computer Science & Engineering
Sister Nivedita University
Jan 2025
It is a field of study that gives computers the ability to learn without begin explicitly
programmed.[Arthur Samuel 1959]
What is Learning?
A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T as measured by P,
improves with experience E. [Tom Mitchell 1998]
ML Applications:
Prediction: e.g. Used Car Price Prediction
Classification: e.g. Detect a Tumor is Malignant or Benign?
Clustering: e.g. User engagement in social media
Association Rule Mining: e.g. Market Basket Analysis
etc.
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 3 / 34
Different Types of Learning
Price(Rs.)
in Lakhs Used Car Price Prediction Tumor Malignant or Benign?
Supervised Learning: Given, 3
Training Set
Learning Algorithm
Features Predicted
Hypothesis (h)
X Y
Overview of Supervised Learning
1 Age
0
2000 4000 6000 8000
3000 Miles Driven Tumor Size
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 6 / 34
Solving Regression Problem using Supervised Learning
Consider Regression Problem: Used Car Price Prediction
Training Dataset:
Miles Driven (x1 ) Engine Capacity(x2 ) in hp Price(y)
1230 1000 220000
3230 890 140500
.. .. ..
.. .. ..
4230 980 80000
(x(i) ; y (i) ) 1 Pm 2
Mean Squared Error: 2m i=1 hθ (x(i) ) − y (i)
Goal :
Minimize J(θ0 , θ1 )
θ0 ,θ1
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 9 / 34
Understanding The Loss/Cost Function
Consider simplified hypothesis: hθ = θ1 x [ i.e., θ0 = 0 ]
m
1 X 2
J(θ1 ) = hθ (x(i) ) − y (i)
2m
i=1
Minimize J(θ1 )
θ1
θ1 = 0:5 θ1 = 1 hθ (x) = x
J(0:5) = 0:58 y J(1) = 0 J(θ1 )
y 3
3 3
hθ (x) = 0:5x 2
2
2
1.5 1
1
1 0
0.5 0 1 2 3 x 0
0 0 0.5 1 1.5 2 θ1
0 1 2 3 x Varying θ1 θ1 = 1 minimizes the cost function
1
J(0:5) = 2:3 [(0:5 − 1)2 + (1 − 2)2 + (1:5 − 3)2 ] = 0:58
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 10 / 34
Understanding The Loss/Cost Function (cont..)
Consider hypothesis hθ = θ0 + θ1 x
hθ (x) = θ0 + θ1 x + θ2 x2
hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3
How to find the parameter vector Θ such that it minimizes Loss/cost function
J(Θ)?
To find the parameters that minimizes cost function J(Θ) use Gradient Descent
∂
θj := θj − α J(θ0 , θ1 ) ( f or j = 0 and j = 1 )
∂θj
}
where α is the Learning Rate.
}
where α is the Learning Rate.
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 18 / 34
Gradient Descent For Multivariate Linear Regression
∂
θj := θj − α J(Θ) ( simultaneously update f or every j = 0, 1, ...n )
∂θj
}
where α is the Learning Rate.
m
1 X 2
J(Θ) = hθ (x(i) ) − y (i)
2m
i=1
m
∂ 1 X
(i)
J(Θ) = hθ (x(i) ) − y (i) xj
∂θj m
i=1
How to choose α?
− If Gradient descent increases then use smaller α.
Feature Scaling
Idea: Make sure features are on a similar scale.
Feature Scaling to eliminate the need for extra steps to reach at local optimal.
For example:
Let, x1 =Size of House ( 0 - 2000 f eet2 )
and x2 = Number of bedrooms (1-5)
Size of house N o. of Bedrooms
x1 = 2000 and x2 = 5
Regression Problem
Classification Problem
Malignant
y1 y
Malignant 1
y=0
y=0
0.5 0.5
y=1 y=1
Benign 0 Benign 0
0 1 2 3 0 1 2 3 Tumor Size
Tumor Size
0 z y=0
g(ΘT X) ≥ 0.5 when ΘT X ≥ 0 y=1
Age
Decision
=⇒ y = 1 if ΘT X ≥0 Boundary
y=0 if ΘT X < 0
Tumor Size
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 25 / 34
Logistic Regression: Cost Function
m training examples (1) (1) (2) (2) (m) (m)
(x , y ) ,(x , y ), ...,(x , y )
x0
x1 n+1 where x = 1 and y ∈ {0, 1}
. ∈ R
n features x = 0
xn
Hypothesis :
1
hθ (x) =
1 + e−ΘT x
Cost Function for Linear Regression
m
1 X1 2
J(Θ) = hθ (x(i) ) − y (i)
m 2
i=1
1 2
Cost hθ (x(i) ), y (i) = hθ (x(i) ) − y (i)
2
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 26 / 34
Logistic Regression: Cost Function
Binary Classification Problem
Cost Function for Logistic Regression
(
− log(hθ (x)) if y = 1
Cost (hθ (x), y) =
− log(1 − hθ (x)) if y = 0
m
1 X
J(Θ) = Cost hθ (x(i) ), y (i)
m
i=1
m
!
1 X
J(Θ) = − y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) )
m
i=1
To fit parameter θ
Minimize J(Θ)
Θ
m
∂ 1 X
(i)
J(Θ) = hθ (x(i) ) − y (i) xj
∂θj m
i=1
∂
θj := θj − α J(Θ) ( simultaneously update f or every j = 0, 1, ...n )
∂θj
}
where α is the Learning Rate.
(3)
x1 x1
hθ (X)
x2 x2
(2)
hθ (X)
x1 x1
(i) (i)
hθ (x) = P (y = i|x; θ) f or i = 1, 2, 3, train hθ (x) for each class i to predict the
probability that y = 1.
(i)
On a new input x, to make a prediction pick the class i that maximizes hθ (x)
For Example,
m
1 X 2
Min hθ (x(i) ) − y (i) + 1000θ3 + 1000θ4
θ 2m
i=1
}
where α is the Learning Rate and λ is the regularization parameter.
Kritanta Saha (Asst. Prof. SNU) Machine Learning Jan 2025 32 / 34
Regularization for Logistic Regression
m
! n
1 X (i) (i) (i) (i) λ X 2
J(Θ) = − y log hθ (x ) + (1 − y ) log(1 − hθ (x ) + θj
m i=1
2m j=1
}
where α is the Learning Rate and λ is the regularization parameter.