Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Introduction of Machine Learning

Uploaded by

gjiacheng123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Introduction of Machine Learning

Uploaded by

gjiacheng123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction of

Machine / Deep Learning


Hung-yi Lee 李宏毅
Machine Learning
≈ Looking for Function
• Speech Recognition

f( ) = “How are you”


• Image Recognition
f( ) = “Cat”
• Playing Go

f( ) = “5-5” (next move)


Different types of Functions
Regression: The function outputs a scalar.
PM2.5 today
Predict PM2.5 of
PM2.5
temperature f tomorrow
Concentration
of O3

Classification: Given options (classes), the function outputs


the correct one.

Spam
filtering f Yes/No
Different types of Functions
Classification: Given options (classes), the function
outputs the correct one.
Each position
is a class
(19 x 19 classes)

Function
a position on
the board

Next move
Playing GO
Structured Learning
create something with
structure (image, document)

Regression,
Classification
How to find a function?
A Case Study
YouTube Channel

https://www.youtube.com/c/HungyiLeeNTU
The function we want to find …

𝑦=𝑓
no. of views
on 2/26
1. Function
with Unknown Parameters

𝑦=𝑓

Model 𝑦 = 𝑏 + 𝑤𝑥! based on domain knowledge


feature
𝑦: no. of views on 2/26, 𝑥!: no. of views on 2/25
𝑤 and 𝑏 are unknown parameters (learned from data)
weight bias
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
values is.
𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is?
Data from 2017/01/01 – 2020/12/31
2017/01/01 01/02 01/03 …… 2020/12/30 12/31

4.8k 4.9k 7.5k 3.4k 9.8k

0.5𝑘+1𝑥! = 𝑦 5.3k

𝑒!= 𝑦 − 𝑦& = 0.4𝑘


label 𝑦&

4.9k
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
values is.
𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is?
Data from 2017/01/01 – 2020/12/31
2017/01/01 01/02 01/03 …… 2020/12/30 12/31

4.8k 4.9k 7.5k 3.4k 9.8k

0.5𝑘+1𝑥! = 𝑦 5.4k 0.5𝑘+1𝑥! = 𝑦


𝑒"= 𝑦 − 𝑦& = 2.1𝑘 𝑒#
𝑦& 𝑦&

4.9k 7.5k 9.8k


Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
values is.

4.8k 4.9k

𝑏 + 𝑤𝑥! = 𝑦 1
Loss: 𝐿 = , 𝑒"
𝑒! 𝑁
"
𝑦&

4.9k

𝑒 = 𝑦 − 𝑦& 𝐿 is mean absolute error (MAE)


𝑒 = 𝑦 − 𝑦& "
𝐿 is mean square error (MSE)
If 𝑦 and 𝑦& are both probability distributions Cross-entropy
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
Model 𝑦 = 𝑏 + 𝑤𝑥! values is.
Small 𝐿

𝑏 Error Surface

Large 𝐿 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿


%,'

Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss
𝐿 Negative Increase w

Positive Decrease w

𝑤( 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿


%,'

Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss 𝜕𝐿
!
𝑤 ←𝑤 −𝜂 ( |%)%!
𝐿 𝜕𝑤
𝜕𝐿
𝜂 |%)%! 𝜂: learning rate
𝜕𝑤
hyperparameters

𝑤( 𝑤! 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿


%,'

Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss 𝜕𝐿
!
𝑤 ←𝑤 −𝜂 ( |%)%!
𝐿 𝜕𝑤
Ø Update 𝑤 iteratively
Does local minima truly cause the problem?

Local global
minima minima
𝑤( 𝑤! 𝑤" 𝑤* 𝑤
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'

Ø (Randomly) Pick initial values 𝑤 (, 𝑏 (


Ø Compute

𝜕𝐿 𝜕𝐿
|%)%! ,')'! 𝑤! ← 𝑤( −𝜂 |%)%! ,')'!
𝜕𝑤 𝜕𝑤
𝜕𝐿 𝜕𝐿
|%)%! ,')'! 𝑏! ← 𝑏( − 𝜂 |%)%! ,')'!
𝜕𝑏 𝜕𝑏

Can be done in one line in most deep learning frameworks


Ø Update 𝑤 and 𝑏 interatively
Model 𝑦 = 𝑏 + 𝑤𝑥!

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿


%,'

Compute 𝜕𝐿⁄𝜕𝑤, 𝜕𝐿⁄𝜕𝑏

𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑏 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
(−𝜂 𝜕𝐿⁄𝜕𝑤, −𝜂 𝜕𝐿⁄𝜕𝑏)

Compute 𝜕𝐿⁄𝜕𝑤, 𝜕𝐿⁄𝜕𝑏

𝑤
Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data

Training

𝑦 = 0.1𝑘 + 0.97𝑥! achieves the smallest loss 𝐿 = 0.48𝑘


on data of 2017 – 2020 (training data)
How about data of 2021 (unseen during training)?
𝐿′ = 0.58𝑘
Red: real no. of views
𝑦 = 0.1𝑘 + 0.97𝑥! blue: estimated no. of views

Views
(k)

2021/01/01 2021/02/14
2017 - 2020 2021
𝑦 = 𝑏 + 𝑤𝑥!
𝐿 = 0.48𝑘 𝐿′ = 0.58𝑘
'
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥#
𝐿 = 0.38𝑘 𝐿′ = 0.49𝑘
#$!
𝒃 𝒘∗𝟏 𝒘∗𝟐 𝒘∗𝟑 𝒘∗𝟒 𝒘∗𝟓 𝒘∗𝟔 𝒘∗𝟕
0.05k 0.79 -0.31 0.12 -0.01 -0.10 0.30 0.18
()
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥# 𝐿′ = 0.46𝑘
𝐿 = 0.33𝑘
#$!
%&
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥#
𝐿 = 0.32𝑘 𝐿′ = 0.46𝑘
#$!
Linear models
Linear models are too simple … we need more sophisticated modes.

Different w
Different 𝑏

𝑥!

Linear models have severe limitation. Model Bias


We need a more flexible model!
red curve = constant + sum of a set of

0
𝑥!

2
All Piecewise Linear Curves
= constant + sum of a set of

More pieces require more


Beyond Piecewise Linear?
Approximate continuous curve
𝑦 by a piecewise linear curve.

𝑥!
To have good approximation, we need sufficient pieces.
red curve = constant + sum of a set of

How to represent
this function? Hard Sigmoid

𝑥!

Sigmoid Function
1
𝑦=𝑐
1 + 𝑒 2 '3%4"
= 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤𝑥!

𝑥!
Different 𝑤

Change slopes

Different b

Shift

Different 𝑐

Change height
red curve = sum of a set of + constant

𝑦
𝑐! 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏! + 𝑤!𝑥!
1

𝑐5 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏5 + 𝑤5𝑥! 3

0
𝑥!
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
0 * 1 + 2 + 3
𝑐" 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏" + 𝑤"𝑥! 2
New Model: More Features
𝑦 = 𝑏 + 𝑤𝑥!

𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
*

𝑦 = 𝑏 + , 𝑤# 𝑥#
#

𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥#
* #
𝑗: 1,2,3
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# no. of features
* # 𝑖: 1,2,3
no. of sigmoid
1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 + 𝑤!!
𝑏! 𝑤!" 𝑥!
𝑤67 : weight for 𝑥7 for i-th sigmoid 1
𝑤!5
2 𝑥"
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 +

1 𝑥5

3
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 +

1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3

𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5


𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5

𝑟! 𝑏! 𝑤!! 𝑤!( 𝑤!+ 𝑥!


𝑟( = 𝑏( + 𝑤(! 𝑤(( 𝑤(+ 𝑥(
𝑟+ 𝑏+ 𝑤+! 𝑤+( 𝑤++ 𝑥+

𝒓 = 𝒃 + 𝑊 𝒙
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3

1
𝑟! + 𝑤!!
𝑏! 𝑤!" 𝑥!
1
𝑤!5
2 𝑥"
𝒓 = 𝒃 + 𝑊 𝒙 𝑟" +

1 𝑥5

3
𝑟5 +

1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3

1
𝑎! 𝑟! + 𝑤!!
1 𝑏! 𝑤!" 𝑥!
𝑎! = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑟! = 1
𝑤!5
1 + 𝑒 28"
2 𝑥"
𝑎" 𝑟" +

1 𝑥5

3
𝒂 =𝜎 𝒓 𝑎5 𝑟5 +

1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3

1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +

1
𝑦 = 𝑏 + 𝒄 , 𝒂
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +

1
𝑦 = 𝑏 + 𝒄, 𝒂

𝒂 =𝜎 𝒓 𝒓 = 𝒃 + 𝑊 𝒙
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Function with unknown parameters
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙

𝒙 feature Rows
of 𝑊

……
𝜃!
Unknown parameters
𝜃(
𝜽 =
𝜃+
𝑊 𝒃 ⋮

𝒄, 𝑏
Back to ML Framework

Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Loss Ø Loss is a function of parameters 𝐿 𝜃
Ø Loss means how good a set of values is.

feature

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑒
label 𝑦
D
Given a set of values

1
Loss: 𝐿 = , 𝑒"
𝑁
"
Back to ML Framework

Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Optimization of New Model 𝜃!

𝜽∗ = 𝑎𝑟𝑔 min 𝐿 𝜽 = 𝜃(
𝜽 𝜃+

Ø (Randomly) Pick initial values 𝜽(
𝜕𝐿 𝜕𝐿
|𝜽$𝜽! 𝜂 |𝜽$𝜽!
𝜕𝜃! 𝜃!! 𝜃!- 𝜕𝜃!
𝒈 = 𝜕𝐿 𝜃(! ← 𝜃(- − 𝜕𝐿
gradient 𝜕𝜃 |𝜽$𝜽! 𝜂 |𝜽$𝜽!
( ⋮ ⋮ 𝜕𝜃(
⋮ ⋮
𝒈 = ∇𝐿 𝜽- 𝜽! ← 𝜽- − 𝜂𝒈
Optimization of New Model
𝜽∗ = 𝑎𝑟𝑔 min 𝐿
𝜽

Ø (Randomly) Pick initial values 𝜽(


Ø Compute gradient 𝒈 = ∇𝐿 𝜽(
𝜽! ← 𝜽( − 𝜂𝒈
Ø Compute gradient 𝒈 = ∇𝐿 𝜽!
𝜽" ← 𝜽! − 𝜂𝒈
Ø Compute gradient 𝒈 = ∇𝐿 𝜽"
𝜽5 ← 𝜽" − 𝜂𝒈
Optimization of New Model
𝜽∗ = 𝑎𝑟𝑔 min 𝐿
𝜽
B batch
Ø (Randomly) Pick initial values 𝜽( 𝐿
Ø Compute gradient 𝒈 = ∇𝐿! 𝜽( 𝐿!
batch
update 𝜽! ← 𝜽( − 𝜂𝒈
N
Ø Compute gradient 𝒈 = ∇𝐿" 𝜽! 𝐿"
update 𝜽" ← 𝜽! − 𝜂𝒈 batch
Ø Compute gradient 𝒈 = ∇𝐿5 𝜽" 𝐿5
update 𝜽5 ← 𝜽" − 𝜂𝒈 batch
1 epoch = see all the batches once
Optimization of New Model
Example 1
Ø 10,000 examples (N = 10,000) B batch
Ø Batch size is 10 (B = 10)
How many update in 1 epoch?
batch
1,000 updates
Example 2
N
Ø 1,000 examples (N = 1,000) batch
Ø Batch size is 100 (B = 100)
How many update in 1 epoch?
10 updates batch
Back to ML Framework

Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙

More variety of models …


Sigmoid → ReLU
How to represent
this function?

𝑥!

Rectified Linear
Unit (ReLU) 𝑐 𝑚𝑎𝑥 0, 𝑏 + 𝑤𝑥!

𝑥!
𝑐′ 𝑚𝑎𝑥 0, 𝑏′ + 𝑤′𝑥!
Sigmoid → ReLU

𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥#
* #

Activation function

𝑦 = 𝑏 + , 𝑐* 𝑚𝑎𝑥 0, 𝑏* + , 𝑤*# 𝑥#
(* #

Which one is better?


Experimental Results

𝑦 = 𝑏 + , 𝑐* 𝑚𝑎𝑥 0, 𝑏* + , 𝑤*# 𝑥#
(* #

linear 10 ReLU 100 ReLU 1000 ReLU


2017 – 2020 0.32k 0.32k 0.28k 0.27k
2021 0.46k 0.45k 0.43k 0.43k
Back to ML Framework

Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙

Even more variety of models …


+ 𝑎! +
𝑥!
1 1

𝑥"
+ 𝑎" +

1 or 1 𝑥5

……
+ 𝑎5 +

1 1

𝒂′ = 𝜎 𝒃′ + 𝑊′ 𝒂 𝒂 =𝜎 𝒃 + 𝑊 𝒙
Experimental Results
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Red: real no. of views
3 layers blue: estimated no. of views

Views
(k)

2021/01/01 2021/02/14
Back to ML Framework

Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙

It is not fancy enough.

Let’s give it a fancy name!


hidden layer hidden layer
+ 𝑎! +
𝑥!
1 1

𝑥"
+ 𝑎" +

1 1 𝑥5

……
+ 𝑎5 +

1 Neuron 1

Neural Network This mimics human brains … (???)

Many layers means Deep Deep Learning


Deep = Many hidden layers
22 layers

http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf

8 layers
6.7%
7.3%
16.4%

AlexNet (2012) VGG (2014) GoogleNet (2014)


Deep = Many hidden layers

152 layers 101 layers

Special
structure

Why we want “Deep” network,


not “Fat” network? 3.57%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Why don’t we go deeper?
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Why don’t we go deeper?
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k

Better on training data, worse on unseen data


Overfitting
Let’s predict no. of views today!
• If we want to select a model for predicting no. of
views today, which one will you use?
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k

We will talk about model selection next time. J


To learn more ……
Backpropagation
Basic Introduction Computing gradients in
an efficient way

https://youtu.be/Dr-WRlEFefw https://youtu.be/ibJpTrp5mcE

You might also like