ECE586BH Lecture1

ECE586BH: Interplay between Control and
Machine Learning
Bin Hu
ECE , University of Illinois Urbana-Champaign
Lecture 1, Fall 2023

Feedback Control Machine learning
• dynamical systems • statistics/optimization

• robustness • large-scale (big data)
• safety-critical • performance-driven
• model-based design • train using data
• CDC/ACC/ECC • NeurIPS/ICML/ICLR
2
Artificial Intelligence Revolution
Safety-critical applications!
3
Flight Control Certification
Ref: J. Renfrow, S. Liebler, and J. Denham. “F-14 Flight Control Law Design,
Verification, and Validation Using Computer Aided Engineering Tools,” 1996.
4
Feedback Control Machine learning
Unified and automated tools for a repeatable and trustable

design process of next generation intelligent systems
5
Example: Robustness is crucial!
• Deep learning: Small adversarial perturbations can fool the classifier!
• Optimization: The oracle can be inexact! xk+1 = xk − α(∇f (xk )+ek )

• Decision and control: Model uncertainty and sim-to-real gap matter!
6
Control for Learning
Control theory addresses unified analysis and design of dynamical systems.
LTI systems Markov jump systems Lur’e systems
ξk+1 = Aξk + Buk ξk+1 = Aik ξk + Bik uk ξk+1 = Aξk + Bφ(Cξk )
T
AT P B

Pn A PA − P
AT P A − P ≺ 0 T
i=1 pij Ai Pi Ai ≺ Pj ≺M
BTP A BTP B
Pros: Unified testing conditions when problem parameters are changed.

Cons: For control, we only need to solve the conditions numerically.
Control for learning: Algorithms and networks treated as control systems
• Neural networks as generalized Lur’e systems
• Stochastic learning algorithms as generalized Lur’e systems
Key message: Robustness can be addressed in a unified manner!

7
Learning for Control
Control theory addresses unified analysis and design of dynamical systems.
LTI systems MJLS Lur’e systems
ξk+1 = Aξk + Buk ξk+1 = Aik ξk + Bik uk ξk+1 = Aξk + Bφ(Cξk )
T
AT P B

T
Pn T A PA − P
A PA − P ≺ 0 i=1 pij Ai Pi Ai ≺ Pj ≺M
BTP A BTP B
Many control design methods rely on convex conditions (BGFB1994).

What about problems that cannot be formulated as convex optimization?
• Direct policy search (e.g. min J(K)) is nonconvex!
Learning for control: Tailoring nonconvex learning theory to push robust control
theory beyond the convex regime
8
Outline
• Control for Learning

• Control methods for certifiably robust neural networks
• A control perspective on stochastic learning algorithms
• Learning for Control

• Global convergence of direct policy search on robust control
9
Robust Control Theory
- ∆
v w
P
1. Approximate the true system as “a linear system + a perturbation”

2. ∆ can be a troublesome element: nonlinearity, uncertainty, or delays
3. Rich control literature including standard textbooks
• Zhou, Doyle, Glover, “ Robust and optimal control,” 1996
4. Many tools: small gain, passivity, dissipativity, Zames-Falb multipliers, etc
5. The integral quadratic constraint (IQC) framework [Megretski, Rantzer
(TAC1997)] provides a unified analysis for “ LTI P + troublesome ∆”
6. Recently, IQC analysis has been extended for more general P
7. Typically, the stability is tested by a SDP condition
10
Quadratic Constraints from Robust Control
• Lur’e system: ξk+1 = Aξk + B∆(Cξk ).
• EX: Gradient method (A = I, B = −αI, C = I, and ∆ = ∇f )
• Question: How to prove that the above Lur’e system converges? We are
looking at the following set of coupled sequences {ξk , wk , vk }
{(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk } ∩ {(ξ, w, v) : wk = ∆(vk )}

• Key idea: Quadratic constraints! Replace the troublesome nonlinear
element ∆ with the following quadratic constraint:
( T )
vk vk
{(v, w) : wk = ∆(vk )} ⊂ (v, w) : M ≤0 ,
wk wk
where M is constructed from the property of ∆.
• If we can show that any sequence from the set below converges,
( T )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , M ≤0 ,
wk wk
then we are done.
11
Quadratic Constraints from Robust Control
Now we are analyzing the sequence from the following set:
( T )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , M ≤0
wk wk
Theorem
If there exists a positive definite matrix P and 0 < ρ < 1 s.t.
T T
A P A − ρ2 P AT P B

C 0 C 0
M
BTP A BTP B 0 I 0 I
T
then ξk+1 P ξk+1 ≤ ρ2 ξkT P ξk and limk→∞ ξk = 0.
T T T T

ξk ξk ξ C 0 C 0 ξk
≤ k M
wk BTP A BTP B wk wk 0 I 0 I wk
| {z } | {z }
T TP ξ T 
P ξk+1 −ρ2 ξk
 
ξk+1 k
v v
 k  M  k ≤0
wk wk
This condition is a semidefinite program (SDP) problem!
12
Illustrative Example: Gradient Descent Method
• Rewrite the gradient method xk+1 = xk − α∇f (xk ) as:
(xk+1 − x? ) = (xk − x? ) − α ∇f (xk )
| {z } | {z } | {z }
ξk+1 ξk wk
• If f is L-smooth and m-strongly convex, then by co-coercivity:

T
x − x? −(m + L)I x − x?

2mLI
≤0
∇f (x) −(m + L)I 2I ∇f (x)
| {z }
M
• We have A = I, B = −αI, C = I, and the following SDP

T T
A P A − ρ2 P A T P B

C 0 C 0
M
BTP A BTP B 0 I 0 I
(1 − ρ2 )p −αp

• This leads to −2mL m + L
+ ⊗I 0
−αp α2 p m+L −2
• Choose (α, ρ, p) to be ( L1 , 1 − m 2 2 L−m 1 2
L , L ) or ( L+m , L+m , 2 (L + m) ) to
? ?
recover standard rates, i.e. kxk+1 − x k ≤ (1 − m/L)kxk − x k
• For this proof, is strong convexity really needed? No! Regularity condition!
13
Illustrative Example: Gradient Descent Method
• We have shown kxk+1 − x? k ≤ (1 − m/L)kxk − x? k
• Is it a contraction, i.e. kxk+1 − x0k+1 k ≤ (1 − m/L)kxk − x0k k?
• (xk+1 − x0k+1 ) = (xk − x0k ) − α (∇f (xk ) − ∇f (x0k ))
| {z } | {z } | {z }
ξk+1 ξk wk
• If f is L-smooth and m-strongly convex, then by co-coercivity:

T
x − x0 x − x0

2mLI −(m + L)I
≤0
∇f (x) − ∇f (x0 ) −(m + L)I 2I ∇f (x) − ∇f (x0 )
| {z }
M
• We have A = I, B = −αI, C = I, and the same SDP

(1 − ρ2 )p −αp

−2mL m + L
+ ⊗I 0
−αp α2 p m+L −2
• Choose (α, ρ, p) to be ( L1 , 1 − m 2
L,L )
2
or ( L+m , L−m 1 2
L+m , 2 (L + m) ) to give
the contraction result!
• For this proof, is strong convexity really needed? Yes!
14
Outline

• Control methods for certifiably robust neural networks

15
Deep Learning for Classification
Deep learning has revolutionized the fields of AI and computer vision!
• Input space X ⊂ Rd to a label space Y := {1, . . . , H}.
• Predict labels from image pixels
• Neural network classifier function f := (f1 , . . . , fH ) : X → RH such that
the predicted label for an input x is arg maxj fj (x).
• Input-label (x, y) is correctly classified if arg maxj fj (x) = y.
16
Deep Learning Models
Deep learning models: f (x) = xD+1 and x0 = x

• Feedforward: xk+1 = σ(Wk xk + bk ) for k = 0, 1, · · · , D
• Residual network: xk+1 = xk − σ(Wk xk + bk ) for k = 0, 1, · · · , D
• Many other structures: transformers, etc
Deep learning models are expressive and generalize well, achieving state-of-the-art
results in computer vision and natural language processing. However, ...
17
Adversarial Attacks and Robustness
• For correct labels (i.e. arg maxj fj (x) = y), one may find kτ k ≤ ε s.t.
arg maxj fj (x + τ ) 6= y (small perturbation lead to wrong prediction)
• Small perturbation can fool modern deep learning models!
• How to deploy deep learning models into safety-critical applications?
• Certified robustness: A classifier f is certifiably robust at radius ε ≥ 0 at
point x with label y if for all τ such that kτ k ≤ ε : arg maxj fj (x + τ ) = y
18
1-Lipschitz Networks for Certified Robustness
• Tsuzuku, Sato, Sugiyama (NeurIPS2018): Let f be L-Lipschitz. If we have
√
Mf (x) := max(0, fy (x) − max
0
f y 0 (x)) > 2Lε
y 6=y
then we have for every τ such that kτ k2 ≤ ε: arg maxj fj (x + τ ) = y

√
• Perturbation smaller than Mf (x)/ 2L cannot deceive f for datapoint x!
• If each layer of a network is 1-Lipchitz, the entire network is 1-Lipschitz.
√
• For each data point, we test whether Mf (x) > 2ε, and then count the
percentage of data points that is guaranteed to be guarded for perturbation
smaller than ε (which is the certified accuracy for that ε).
• We need to train a Lipschitz neural network with good prediction margins!
Previous approaches:

• Spectral normalization (MKKY2018): xk+1 = σ 1
kWk k2 Wk xk + bk
• Orthogonality (TK2021, SF2021): xk+1 = σ(Wk xk + bk ) with WkT Wk = I
• Convex potential layer (MDAA2022): xk+1 = xk − kW2 k2 Wk σ(WkT x + bk )
k 2
1
−
• AOL (PL2022): xk+1 = σ(Wk diag( j |Wk Wk |ij ) 2 xk + bk )
T
P
19
My Focus: Principles for 1-Lipschitz Networks
Theorem (AHDAH2023)
If there exists nonsingular diagonal Tk s.t. WkT Wk Tk , then we have
−1
1. The layer xk+1 = σ(Wk Tk 2 xk + bk ) is 1-Lipschitz for any 1-Lipschitz σ.
2. The layer xk+1 = xk − 2Wk Tk−1 σ(WkT x + bk ) is 1-Lipschitz if σ is ReLU.
−1 −1 −1
kxk+1 − x0k+1 k2 ≤ kWk Tk 2 (xk − x0k )k2 = (xk − x0k )T Tk 2 WkT Wk Tk 2 (xk − x0k )
| {z }
≤kxk −x0k k2
The second statement can be proved using the quadratic constraint argument.
A Unification of Existing 1-Lipschitz Neural Networks

• Spectral normalization: Statement 1 with Tk = kWk k22 I
• Orthogonal weights: Statement 1 with Tk = I and WkT Wk = I
• CPL: Statement 2 with Tk = kWk k22 I
• AOL: Statement 1 with Tk = diag( nj=1 |WkT Wk |ij )
P
• Control Theory (SLL): Tk = diag( nj=1 |WkT Wk |ij qj /qi ).

P
20
Experimental Results
4 versions of SDP-based Lipchitz Network (SLL) (S, M, L, XL)
Natural Provable Accuracy (ε)

Datasets Models
Accuracy 36 72 108
1
255 255 255
Cayley Large 43.3 29.2 18.8 11.0 -

SOC 20 48.3 34.4 22.7 14.2 -
SOC+ 20 47.8 34.8 23.7 15.8 -
CPL XL 47.8 33.4 20.9 12.6 -
AOL Large 43.7 33.7 26.3 20.7 7.8
CIFAR100
SLL Small 45.8 34.7 26.5 20.4 7.2
SLL Medium 46.5 35.6 27.3 21.1 7.7
SLL Large 46.9 36.2 27.9 21.6 7.9
SLL X-Large 47.6 36.5 28.2 21.8 8.2
• Competitive results over CIFAR100 and TinyImageNet

• Many extensions: Lipschitz deep equilibrium models, neural ODEs, etc
21
Quadratic Constraints for Lipschitz Networks
• Residual network: xk+1 = xk − Gk σ(WkT xk + bk ) for k = 0, 1, · · · , D.
• 1-Lipschitz layer: How to enforce kxk+1 − x0k+1 k ≤ kxk − x0k k?
• (xk+1 − x0k+1 ) = (xk − x0k ) −Gk σ(WkT xk + bk ) − σ(WkT x0k + bk )

| {z } | {z } | {z }
ξk+1 ξk wk
• We will use the property of σ to construct Mk such that we only need to

look at the following set with Ak = I and Bk = −Gk :
( T )
ξk ξk
(ξ, w) : ξk+1 = Ak ξk + Bk wk , Mk ≤0 ,
wk wk
• Then we can ensure kξk+1 k ≤ kξk k via enforcing a SDP for the set:
T T T
AT AT

Ak P Ak − P k P Bk M =⇒ ξk Ak Ak − I k Bk ξk
k |{z} ≤0
BkT P Ak BkT P Bk wk BkT Ak BkT Bk wk
P =I | {z }
kξk+1 k2 −kξk k2 =kxk+1 −x0k+1 k2 −kxk −x0k k2
22
• Since σ is slope-restricted on [0, 1], the following scalar-version incremental
quadratic constraint holds with m = 0 and L = 1:
T
a − a0 a − a0

2mL −(m + L)
≤0
σ(a) − σ(a0 ) −(m + L) 2 σ(a) − σ(a0 )
| {z }
 

0 −1 
−1 2
• The vector-version quadratic constraint: For diagonal Γk 0, we have

T
vk − vk0 vk − vk0

0 −Γk
≤0
σ(vk ) − σ(vk0 ) −Γk 2Γk σ(vk ) − σ(vk0 )
| {z }
Xk
• Choosing vk = WkT xk + bk and vk0 = WkT x0k + bk , we have

T
WkT (xk − x0k ) WkT (xk − x0k )

Xk ≤0
σ(WkT xk + bk ) − σ(WkT x0k + bk ) σ(WkT xk + bk ) − σ(WkT x0k + bk )
23
T
−Γk WkT

ξ ξk Wk 0 0 0
• We get k Mk ≤ 0 with Mk =
wk wk 0 I −Γk 2Γk 0 I
| {z }
 
0 −Wk Γk 
−Γk WkT

2Γk
Theorem
If there exists diagonal Γk 0 such that

0 −Gk 0 −Wk Γk

−GT k GTk Gk −Γk WkT 2Γk
then the residual network xk+1 = xk − Gk σ(WkT xk + bk ) is 1-Lipschitz.
• Analytical solution: Gk = Wk Γk and Γk WkT Wk Γk 2Γk .

• Suppose Γk is nonsingular, and Tk = 2Γ−1k . Then the residual network
xk+1 = xk − 2Wk Tk−1 σ(WkT xk + bk ) is 1-Lipschitz as long as Tk WkT Wk
• Ref: Araujo, Havens, Delattre, Allauzen, H.. A unifying algebraic
perspective on Lipschitz neural networks, ICLR, 2023. (Spotlight) 24
Outline

• Control Methods on Certifiably Robust Neural Networks
• A Control Perspective on Stochastic Learning Algorithms

25
History: Computer-Assisted Proofs in Optimization
In the past ten years, much progress has been made in leveraging SDPs to assist
the convergence rate analysis of optimization methods.
• Drori and Teboulle (MP2014): numerical worst-case bounds via the
performance estimation problem (PEP) formulation
• Lessard, Recht, Packard (SIOPT2016): numerical linear rate bounds using
integral quadratic constraints (IQCs) from robust control theory
• Taylor, Hendrickx, Glineur (MP2017): interpolation conditions for PEPs
• H., Lessard (ICML2017): first SDP-based analytical proof for Nesterov’s
accelerated rate
• H., Seiler, Ranzter (COLT2017): first paper on SDP-based convergence
proofs for stochastic optimization using jump system theory and IQCs
• Van Scoy, Freeman, and Lynch (LCSS2017): first paper on control-oriented
design of accelerated methods: triple momentum
Taken further by different groups
• inexact gradient methods, proximal gradient methods, conditional gradient
methods, operator splitting methods, mirror descent methods, distributed
gradient methods, monotone inclusion problems 26
Stochastic Methods for Machine Learning
• Many learning tasks (regression/classification) lead to finite-sum ERM
n
1X
minp fi (x)
x∈R n i=1
where fi (x) = li (x) + λR(x) (li is the loss, and R avoids over-fitting).
• Stochastic gradient descent (SGD): xk+1 = xk − α∇fik (xk )
• Inexact oracle: xk+1 = xk − α(∇fik (xk ) + ek ) where kek k ≤ δk∇fik (xk )k
(the angle θ between (ek + ∇fik (xk )) and ∇fik (xk ) satisfies | sin(θ)| ≤ δ)
• Algorithm change: SAG (SRF2017) vs. SAGA (DBL2014)
n
!
k+1 k ∇fik (xk ) − yikk 1X k
SAG: x =x −α + y
n n i=1 i
n
!
k+1 k k k 1X k
SAGA: x = x − α ∇fik (x ) − yik + y
n i=1 i
∇fi (xk ) if i = ik

k+1
where yi :=
yik otherwise
• Markov assumption: In reinforcement learning, {ik } can be Markovian 27
My Focus: Unified Analysis of Stochastic Methods
Assumption




• fi smooth, f RSI 





• ik is IID or Markovian 




 Bound

• Oracle is exact or inexact 



• many other possibilities  • Ekxk − x? k2 ≤ c2 ρk + O(α)



 • Ekxk − x? k2 ≤ c2 ρk

Method

• Other forms





• SGD 





• SAGA-like methods 





• Temporal difference learning
How to automate rate analysis of stochastic learning algorithms? Use
numerical semidefinite programs to support search for analytical proofs?
assumption + method =⇒ bound

28
My Focus: Stochastic Methods for Learning
In the deterministic setting, we just need to show that the trajectories generated
by optimization methods belong to the following set:
( T )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , Mj ≤ Λj , j ∈ Π
wk wk
What to do for stochastic optimization (e.g. xk+1 = xk − α∇fik (xk ) where

ik ∈ {1, · · · , n} is sampled)?
• Stochastic quadratic constraints: Show that the trajectories generated by
stochastic optimization methods belong to the following set:
( T )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , E Mj ≤ Λj , j ∈ Π
wk wk
• Jump system approach: Show that the trajectories generated by stochastic

optimization methods belong to the following set:
( T )
vk vk
(ξ, w, v) : ξk+1 = Aik ξk + Bik wk , vk = Cik ξk , Mj ≤ Λj , j ∈ Π
wk wk
where Aik ∈ {A1 , · · · , An }, Bik ∈ {B1 , · · · , Bn }, and Cik ∈ {C1 , · · · , Cn }

29
Stochastic Quadratic Constraints
Suppose we can show that the trajectories generated by stochastic optimization
methods belong to the following set:
( T )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , E Mj ≤ Λj , j ∈ Π
wk wk
Theorem
If there exists a positive definite matrix P , non-negative λj and 0 < ρ < 1 s.t.
T T
X
C 0 C 0
λj Mj
BTP A BTP B 0 I 0 I
j∈Π
T
P ξk+1 ≤ ρ2 EξkT P ξk +
P
then Eξk+1 j∈Π λj Λj .
T X T
AT P A − ρ 2 P AT P B

ξk ξk vk v
≤ λj Mj k
wk BTP A BTP B wk wk wk
| {z } j∈Π
T
ξk+1 TP ξ
P ξk+1 −ρ2 ξk k
Then take expectation and apply the expected quadratic constraints!

30
Main Result: Analysis of Biased SGD
• Consider xk+1 = xk − α(∇fik (xk ) + ek ) with kek k2 ≤ δ 2 k∇fik (xk )k2 + c2
• If c = 0, the bound means the angle θ between (ek + ∇fik (xk )) and
∇fik (xk ) satisfies | sin(θ)| ≤ δ

• Rewritten as (xk+1 − x? ) = (xk − x? ) + −αI −αI ∇fik (xk )

| {z } | {z } ek
ξk+1 ξk | {z }
wk
• Assume the restricted secant inequality ∇f (x) (x − x ) ≥ mkx − x? k2

T ?
• Assume fi is L-smooth, i.e. k∇fi (x) − ∇fi (x? )k ≤ Lkx − x? k

T 
xk − x? xk − x?
  
2mI −I 0
• 1st QC: E ∇fik (xk )  −I 0 0 ∇fik (xk ) ≤ |{z}0
ek 0 0 0 ek Λ1
| {z }
M1
?
T 
−2L2 I 0 xk − x?
  
xk − x 0 n
2X
• 2nd QC: E ∇fik (xk )  0 I 0 ∇fik (xk ) ≤ k∇fi (x? )k2
n i=1
ek 0 0 0 ek
| {z } | {z }
M2 Λ2
31
• We can rewrite kek k2 ≤ δ 2 k∇fik (xk )k2 + c2 as
T 
xk − x? xk − x?
  
0 0 0
E ∇fik (xk ) 0 −δ 2 I c2
0 ∇fik (xk ) ≤ |{z}
ek 0 0 I ek Λ3
| {z }
M3
• We have A = I, B = −αI

−αI , C = I, and the following SDP
T 3 T
X
C 0 C 0
λ j Mj
BTP A BTP B 0 I 0 I
j=1
• Biased SGD satisfies Ekxk+1 − x? k2 ≤ ρ2 Ekxk − x? k2 + λ2 Λ2 + λ3 c2 if

1 − ρ2
     2 
−α −α −2m 1 0 2L 0 0
 −α α 2 − δ 2 λ3 α 2  + λ1  1 0 0 + λ2  0 −1 0 0
2 2
−α α α + λ3 0 0 0 0 0 0
32
• Given Ekx0 − x? k2 ≤ U0 , set Uk+1 = min(ρ2 Uk + λ2 Λ2 + λ3 c2 ) with
1 − ρ2
     2 
−α −α −2m 1 0 2L 0 0
 −α α 2 − δ 2 λ3 α 2  + λ1  1 0 0 + λ2  0 −1 0 0
2 2
−α α α + λ3 0 0 0 0 0 0
then we have Ekxk − x? k2 ≤ Uk . This leads to a sequential SDP problem.
• This problem has an exact solution
p p 2
Uk+1 = α c2 + δ 2 Λ2 + 2L2 δ 2 Uk + (1 − 2mα + 2L2 α2 )Uk + Λ2 α2
c2 +δ 2 Λ2 m(c2 (2L2 −m2 )+(1−δ 2 )Λ2 m2 )

• limk→∞ Uk = m2 −2δ 2 L2 + (m2 −2δ 2 L2 )2 α + O(α2 )
2 2 2
• Rate = 1 − m −2δ m
L
α + O(α2 )
• For different assumptions, modify (Mj , Λj )!
• H., Seiler, and Lessard. Analysis of biased stochastic gradient descent using
sequential semidefinite programs. Mathematical Programming, 2021
• Syed, Dall’Anese, H.. Bounds for the tracking error and dynamic regret of
inexact online optimization methods: A unified analysis via sequential SDPs.
33
Jump System Approach
n T
1 X AT 2
AT
X
i P Ai − ρ P i P Bi λj
C 0
Mj
C 0
T
n i=1 B i P Ai BiT P Bi 0 I 0 I
j∈Π
Pros:
• General enough to handle many algorithms: H., Seiler, Rantzer (COLT2017)
Method Ãik B̃ik C̃
" #
eik eTik

In − eik eTik 0̃ T
SAGA 0̃ 1
−α
n (e − neik )
T
1 −αeTik
" #
eik eTik

In − eik eTik 0̃ T
SAG 0̃ 1
−αn (e − eik )
T
1 −αn eik
T
• General enough to handle Markov {ik }: Syed and H. (NeurIPS2019), Guo

and H. (ACC2022a,2022b)
Cons:
• SDPs are much bigger than the ones obtained from stochastic quadratic
constraints, and we have to exploit SDP structures for simplifications
34
Control for Learning: Summary
• Iterative learning algorithms and neural network layers can be thought as

feedback control systems.
• The quadratic constraint approach from control theory can be leveraged to

formulate SDP conditions for machine learning research.
• Different from the study in control, now we want to obtain analytical

solutions of the SDPs!
35
Outline

• Control methods on certifiably robust neural networks

36

ECE586BH Lecture1

Uploaded by

Copyright:

Available Formats

ECE586BH Lecture1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECE586BH Lecture1

Uploaded by

Copyright:

Available Formats

ECE586BH: Interplay between Control and

Lecture 1, Fall 2023

• dynamical systems • statistics/optimization

Unified and automated tools for a repeatable and trustable

• Optimization: The oracle can be inexact! xk+1 = xk − α(∇f (xk )+ek )

LTI systems Markov jump systems Lur’e systems

ξk+1 = Aξk + Buk ξk+1 = Aik ξk + Bik uk ξk+1 = Aξk + Bφ(Cξk )

Pros: Unified testing conditions when problem parameters are changed.

Key message: Robustness can be addressed in a unified manner!

LTI systems MJLS Lur’e systems

ξk+1 = Aξk + Buk ξk+1 = Aik ξk + Bik uk ξk+1 = Aξk + Bφ(Cξk )

Many control design methods rely on convex conditions (BGFB1994).

• Control for Learning

• Learning for Control

1. Approximate the true system as “a linear system + a perturbation”

{(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk } ∩ {(ξ, w, v) : wk = ∆(vk )}

• If f is L-smooth and m-strongly convex, then by co-coercivity:

• We have A = I, B = −αI, C = I, and the following SDP

• If f is L-smooth and m-strongly convex, then by co-coercivity:

• We have A = I, B = −αI, C = I, and the same SDP

• Control for Learning

• Learning for Control

Deep learning models: f (x) = xD+1 and x0 = x

then we have for every τ such that kτ k2 ≤ ε: arg maxj fj (x + τ ) = y

A Unification of Existing 1-Lipschitz Neural Networks

• Control Theory (SLL): Tk = diag( nj=1 |WkT Wk |ij qj /qi ).

Natural Provable Accuracy (ε)

Cayley Large 43.3 29.2 18.8 11.0 -

• Competitive results over CIFAR100 and TinyImageNet

• We will use the property of σ to construct Mk such that we only need to

• The vector-version quadratic constraint: For diagonal Γk  0, we have

• Choosing vk = WkT xk + bk and vk0 = WkT x0k + bk , we have

then the residual network xk+1 = xk − Gk σ(WkT xk + bk ) is 1-Lipschitz.

• Analytical solution: Gk = Wk Γk and Γk WkT Wk Γk  2Γk .

• Control for Learning

• Learning for Control

assumption + method =⇒ bound

What to do for stochastic optimization (e.g. xk+1 = xk − α∇fik (xk ) where

• Jump system approach: Show that the trajectories generated by stochastic

where Aik ∈ {A1 , · · · , An }, Bik ∈ {B1 , · · · , Bn }, and Cik ∈ {C1 , · · · , Cn }

Then take expectation and apply the expected quadratic constraints!

• Assume the restricted secant inequality ∇f (x) (x − x ) ≥ mkx − x? k2

• Assume fi is L-smooth, i.e. k∇fi (x) − ∇fi (x? )k ≤ Lkx − x? k

• Biased SGD satisfies Ekxk+1 − x? k2 ≤ ρ2 Ekxk − x? k2 + λ2 Λ2 + λ3 c2 if

c2 +δ 2 Λ2 m(c2 (2L2 −m2 )+(1−δ 2 )Λ2 m2 )

• General enough to handle Markov {ik }: Syed and H. (NeurIPS2019), Guo

• Iterative learning algorithms and neural network layers can be thought as

• The quadratic constraint approach from control theory can be leveraged to

• Different from the study in control, now we want to obtain analytical

• Control for Learning

• Learning for Control

You might also like

• The vector-version quadratic constraint: For diagonal Γk 0, we have

• Analytical solution: Gk = Wk Γk and Γk WkT Wk Γk 2Γk .