ECE586BH Lecture1
ECE586BH Lecture1
ECE586BH Lecture1
Machine Learning
Bin Hu
ECE , University of Illinois Urbana-Champaign
2
Artificial Intelligence Revolution
Safety-critical applications!
3
Flight Control Certification
Ref: J. Renfrow, S. Liebler, and J. Denham. “F-14 Flight Control Law Design,
Verification, and Validation Using Computer Aided Engineering Tools,” 1996.
4
Feedback Control Machine learning
5
Example: Robustness is crucial!
• Deep learning: Small adversarial perturbations can fool the classifier!
6
Control for Learning
Control theory addresses unified analysis and design of dynamical systems.
T
AT P B
Pn A PA − P
AT P A − P ≺ 0 T
i=1 pij Ai Pi Ai ≺ Pj ≺M
BTP A BTP B
T
AT P B
T
Pn T A PA − P
A PA − P ≺ 0 i=1 pij Ai Pi Ai ≺ Pj ≺M
BTP A BTP B
Learning for control: Tailoring nonconvex learning theory to push robust control
theory beyond the convex regime
8
Outline
9
Robust Control Theory
- ∆
v w
P
Theorem
If there exists a positive definite matrix P and 0 < ρ < 1 s.t.
T T
A P A − ρ2 P AT P B
C 0 C 0
M
BTP A BTP B 0 I 0 I
T
then ξk+1 P ξk+1 ≤ ρ2 ξkT P ξk and limk→∞ ξk = 0.
T T T T
A P A − ρ2 P AT P B
ξk ξk ξ C 0 C 0 ξk
≤ k M
wk BTP A BTP B wk wk 0 I 0 I wk
| {z } | {z }
T TP ξ T
P ξk+1 −ρ2 ξk
ξk+1 k
v v
k M k ≤0
wk wk
This condition is a semidefinite program (SDP) problem!
12
Illustrative Example: Gradient Descent Method
• Rewrite the gradient method xk+1 = xk − α∇f (xk ) as:
(xk+1 − x? ) = (xk − x? ) − α ∇f (xk )
| {z } | {z } | {z }
ξk+1 ξk wk
(1 − ρ2 )p −αp
• This leads to −2mL m + L
+ ⊗I 0
−αp α2 p m+L −2
• Choose (α, ρ, p) to be ( L1 , 1 − m 2 2 L−m 1 2
L , L ) or ( L+m , L+m , 2 (L + m) ) to
? ?
recover standard rates, i.e. kxk+1 − x k ≤ (1 − m/L)kxk − x k
• For this proof, is strong convexity really needed? No! Regularity condition!
13
Illustrative Example: Gradient Descent Method
• We have shown kxk+1 − x? k ≤ (1 − m/L)kxk − x? k
• Is it a contraction, i.e. kxk+1 − x0k+1 k ≤ (1 − m/L)kxk − x0k k?
• (xk+1 − x0k+1 ) = (xk − x0k ) − α (∇f (xk ) − ∇f (x0k ))
| {z } | {z } | {z }
ξk+1 ξk wk
• Choose (α, ρ, p) to be ( L1 , 1 − m 2
L,L )
2
or ( L+m , L−m 1 2
L+m , 2 (L + m) ) to give
the contraction result!
• For this proof, is strong convexity really needed? Yes!
14
Outline
15
Deep Learning for Classification
Deep learning has revolutionized the fields of AI and computer vision!
• Input space X ⊂ Rd to a label space Y := {1, . . . , H}.
• Predict labels from image pixels
• Neural network classifier function f := (f1 , . . . , fH ) : X → RH such that
the predicted label for an input x is arg maxj fj (x).
• Input-label (x, y) is correctly classified if arg maxj fj (x) = y.
16
Deep Learning Models
• For correct labels (i.e. arg maxj fj (x) = y), one may find kτ k ≤ ε s.t.
arg maxj fj (x + τ ) 6= y (small perturbation lead to wrong prediction)
• Small perturbation can fool modern deep learning models!
• How to deploy deep learning models into safety-critical applications?
• Certified robustness: A classifier f is certifiably robust at radius ε ≥ 0 at
point x with label y if for all τ such that kτ k ≤ ε : arg maxj fj (x + τ ) = y
18
1-Lipschitz Networks for Certified Robustness
• Tsuzuku, Sato, Sugiyama (NeurIPS2018): Let f be L-Lipschitz. If we have
√
Mf (x) := max(0, fy (x) − max
0
f y 0 (x)) > 2Lε
y 6=y
The second statement can be proved using the quadratic constraint argument.
• Then we can ensure kξk+1 k ≤ kξk k via enforcing a SDP for the set:
T T T
AT AT
Ak P Ak − P k P Bk M =⇒ ξk Ak Ak − I k Bk ξk
k |{z} ≤0
BkT P Ak BkT P Bk wk BkT Ak BkT Bk wk
P =I | {z }
kξk+1 k2 −kξk k2 =kxk+1 −x0k+1 k2 −kxk −x0k k2
22
Quadratic Constraints for Lipschitz Networks
• Since σ is slope-restricted on [0, 1], the following scalar-version incremental
quadratic constraint holds with m = 0 and L = 1:
T
a − a0 a − a0
2mL −(m + L)
≤0
σ(a) − σ(a0 ) −(m + L) 2 σ(a) − σ(a0 )
| {z }
0 −1
−1 2
Theorem
If there exists diagonal Γk 0 such that
0 −Gk 0 −Wk Γk
−GT k GTk Gk −Γk WkT 2Γk
25
History: Computer-Assisted Proofs in Optimization
In the past ten years, much progress has been made in leveraging SDPs to assist
the convergence rate analysis of optimization methods.
• Drori and Teboulle (MP2014): numerical worst-case bounds via the
performance estimation problem (PEP) formulation
• Lessard, Recht, Packard (SIOPT2016): numerical linear rate bounds using
integral quadratic constraints (IQCs) from robust control theory
• Taylor, Hendrickx, Glineur (MP2017): interpolation conditions for PEPs
• H., Lessard (ICML2017): first SDP-based analytical proof for Nesterov’s
accelerated rate
• H., Seiler, Ranzter (COLT2017): first paper on SDP-based convergence
proofs for stochastic optimization using jump system theory and IQCs
• Van Scoy, Freeman, and Lynch (LCSS2017): first paper on control-oriented
design of accelerated methods: triple momentum
Taken further by different groups
• inexact gradient methods, proximal gradient methods, conditional gradient
methods, operator splitting methods, mirror descent methods, distributed
gradient methods, monotone inclusion problems 26
Stochastic Methods for Machine Learning
• Many learning tasks (regression/classification) lead to finite-sum ERM
n
1X
minp fi (x)
x∈R n i=1
where fi (x) = li (x) + λR(x) (li is the loss, and R avoids over-fitting).
• Stochastic gradient descent (SGD): xk+1 = xk − α∇fik (xk )
• Inexact oracle: xk+1 = xk − α(∇fik (xk ) + ek ) where kek k ≤ δk∇fik (xk )k
(the angle θ between (ek + ∇fik (xk )) and ∇fik (xk ) satisfies | sin(θ)| ≤ δ)
• Algorithm change: SAG (SRF2017) vs. SAGA (DBL2014)
n
!
k+1 k ∇fik (xk ) − yikk 1X k
SAG: x =x −α + y
n n i=1 i
n
!
k+1 k k k 1X k
SAGA: x = x − α ∇fik (x ) − yik + y
n i=1 i
∇fi (xk ) if i = ik
k+1
where yi :=
yik otherwise
• Markov assumption: In reinforcement learning, {ik } can be Markovian 27
My Focus: Unified Analysis of Stochastic Methods
Assumption
• fi smooth, f RSI
• ik is IID or Markovian
Bound
• Oracle is exact or inexact
• many other possibilities • Ekxk − x? k2 ≤ c2 ρk + O(α)
• Ekxk − x? k2 ≤ c2 ρk
Method
• Other forms
• SGD
• SAGA-like methods
• Temporal difference learning
How to automate rate analysis of stochastic learning algorithms? Use
numerical semidefinite programs to support search for analytical proofs?
Theorem
If there exists a positive definite matrix P , non-negative λj and 0 < ρ < 1 s.t.
T T
A P A − ρ2 P AT P B
X
C 0 C 0
λj Mj
BTP A BTP B 0 I 0 I
j∈Π
T
P ξk+1 ≤ ρ2 EξkT P ξk +
P
then Eξk+1 j∈Π λj Λj .
T X T
AT P A − ρ 2 P AT P B
ξk ξk vk v
≤ λj Mj k
wk BTP A BTP B wk wk wk
| {z } j∈Π
T
ξk+1 TP ξ
P ξk+1 −ρ2 ξk k
?
T
−2L2 I 0 xk − x?
xk − x 0 n
2X
• 2nd QC: E ∇fik (xk ) 0 I 0 ∇fik (xk ) ≤ k∇fi (x? )k2
n i=1
ek 0 0 0 ek
| {z } | {z }
M2 Λ2
31
Main Result: Analysis of Biased SGD
• We can rewrite kek k2 ≤ δ 2 k∇fik (xk )k2 + c2 as
T
xk − x? xk − x?
0 0 0
E ∇fik (xk ) 0 −δ 2 I c2
0 ∇fik (xk ) ≤ |{z}
ek 0 0 I ek Λ3
| {z }
M3
• We have A = I, B = −αI
−αI , C = I, and the following SDP
T 3 T
A P A − ρ2 P AT P B
X
C 0 C 0
λ j Mj
BTP A BTP B 0 I 0 I
j=1
Pros:
• General enough to handle many algorithms: H., Seiler, Rantzer (COLT2017)
Method Ãik B̃ik C̃
" #
eik eTik
In − eik eTik 0̃ T
SAGA 0̃ 1
−α
n (e − neik )
T
1 −αeTik
" #
eik eTik
In − eik eTik 0̃ T
SAG 0̃ 1
−αn (e − eik )
T
1 −αn eik
T
35
Outline
36