Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
OptNet: Differentiable
Optimization as a Layer in
Neural Networks
Brandom Amos and J. Zico Kolter
CMU
ICML 2017
Abstract
• Present OptNet, a network architecture integrating optimization
problems as individual layers.
• Develop a highly efficient solver for these layers that exploits fast
GPU-based batch solving within a primal-dual interior point method.
• Experiments show that OptNet have the ability to learn constraints
better than other neural architectures.
Introduction
• Consider how to treat exact, constrained optimization as an individual
layer within a deep learning architecture.
Traditional feedforward
network
OptNet
Target Simple function complex operations
Gradient respect to loss function respect to KKT conditions
Optimization Gradient descent etc.
Primal-dual interior-point
method
• Consider quadratic programs (QP)
minimizez
1
2
𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧
subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
• Develop a custom solver which can simultaneously solve multiple
small QPs in batch form.
• In total, the solver solve batched of QP over 100 times faster than
existing QP solver (Gurobi).
Background
Matrix Calculus
• scalar by vector • vector by scalar • vector by vector
Implicit differentiation
• In implicit differentiation, we differentiate each side of an equation with
two variables by treating one of the variables as a function of the other.
• Using the implicit differentiation, we treat 𝑦 as an implicit function of 𝑥
• 𝑥2 + 𝑦2 = 25
• ⇒ 2𝑥 + 2𝑦
𝑑𝑦
𝑑𝑥
= 0
• ⇒ 2𝑥𝑑𝑥 + 2𝑦𝑑𝑦 = 0
• ⇒ 𝑦′ =
𝑑𝑦
𝑑𝑥
=
−2𝑥
2𝑦
=
−𝑥
𝑦
• 𝑚 = 𝑦′ =
−𝑥
𝑦
=
−3
−4
=
3
4
Source: https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-2/a/implicit-differentiation-review
KKT condition
• Consider the optimization problem
max f 𝐱
subject to
gi 𝐱 ≤ 0 for i = 1, … , 𝑚,
ℎ𝑗 𝒙 = 0 for j = 1, … , 𝑙.
• If 𝒙∗ is a local optima, then exist 𝜇𝑖 (𝑖 = 1, … , 𝑚) and 𝜆𝑗 (𝑗 = 1, … , 𝑙) such that
• Stationarity
−𝛻𝑓 𝒙∗ = ෍
𝑖=1
𝑚
𝜇𝑖 𝛻𝑔𝑖 𝒙∗ + ෍
𝑗=1
𝑙
𝜆𝑗 𝛻ℎ𝑗(𝒙∗)
• Primal feasibility
𝑔𝑖 𝒙∗ ≤ 0, for i = 1, … , 𝑚
ℎ𝑗 𝒙∗
= 0, for j = 1, … , 𝑙
• Dual feasibility
𝜇𝑖 ≥ 0, for i = 1, … , 𝑚
• Complementary slackness
𝜇𝑖 𝑔𝑖 𝐱∗ = 0, for i = 1, … , 𝑚
Source: https://en.m.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
Source: https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf
Primal-dual interior-point method
• An algorithm to solve convex optimization problem.
• More efficient compared to Barrier method.
Source: http://www.cs.cmu.edu/~aarti/Class/10725_Fall17/Lecture_Slides/primal-dual.pdf
OptNet
minimizez
1
2
𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧
subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
• To obtain the derivatives of 𝑧 with respect to parameters, we
differentiate the following KKT conditions.
1. Stationarity: 𝑄𝑧∗ + 𝑞 + 𝐴 𝑇 𝑣∗ + 𝐺 𝑇 𝜆∗ = 0
2. Primal feasibility: 𝐴𝑧∗ − 𝑏 = 0
3. Complementary slackness: 𝐷 𝜆∗ 𝐺𝑧∗ − ℎ = 0
Stationarity
• Recall stationarity
−𝛻f 𝐱∗ = ෍
i=1
m
μi 𝛻gi 𝐱∗ + ෍
j=1
l
λj 𝛻hj(𝐱∗)
• obtain
𝑄𝑧∗ + 𝑞 + 𝐴 𝑇 𝑣∗ + 𝐺 𝑇 𝜆∗ = 𝟎
• By implicit differentiation
• ⇒ 𝑑𝑄𝑧∗ + 𝑄𝑑𝑧 + 𝑑𝑞 + 𝑑𝐴 𝑇 𝑣∗ + 𝐴 𝑇 𝑑𝑣 + 𝑑𝐺 𝑇 𝜆∗ + 𝐺 𝑇 𝑑𝜆 = 𝟎
• ⇒ 𝑄𝑑𝑧 + 𝐺 𝑇 𝑑𝜆 + 𝐴 𝑇 𝑑𝑣 = −𝑑𝑄𝑧∗ − 𝑑𝑞 − 𝑑𝐴 𝑇 𝑣∗ − 𝑑𝐺 𝑇 𝜆∗
minimizez
1
2
𝑧 𝑇
𝑄𝑧 + 𝑞 𝑇
𝑧
subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
Primal feasibility
• Recall primal feasibility
• 𝑔𝑖 𝒙∗
≤ 0, for i = 1, … , 𝑚
• ℎ𝑗 𝒙∗
= 0, for j = 1, … , 𝑙
• obtain
𝐴𝑧∗ − 𝑏 = 𝟎 (only take equality constraints)
• By implicit differentiation
• ⇒ 𝑑𝐴𝑧∗ + 𝐴𝑑𝑧 − 𝑑𝑏 = 𝟎
• ⇒ 𝐴𝑑𝑧 + 𝟎𝑑𝜆 + 𝟎𝑑𝑣 = −𝑑𝐴𝑧∗ + 𝑑𝑏
minimizez
1
2
𝑧 𝑇
𝑄𝑧 + 𝑞 𝑇
𝑧
subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
Complementary slackness
• Recall complementary slackness
𝜇𝑖 𝑔𝑖 𝐱∗
= 0, for i = 1, … , 𝑚
• obtain
minimizez
1
2
𝑧 𝑇
𝑄𝑧 + 𝑞 𝑇
𝑧
subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
⇒ 𝐷 𝜆∗ 𝐺𝑑𝑧 + 𝐷 𝐺𝑧∗ − ℎ 𝑑𝜆 + 0𝑑𝑣 = −𝐷 𝜆∗ 𝑑𝐺𝑧∗ + 𝐷 𝜆∗ 𝑑ℎ
by *
*
Matrix form
ቐ
𝑄𝑑𝑧 + 𝐺 𝑇 𝑑𝜆 + 𝐴 𝑇 𝑑𝑣 = −𝑑𝑄𝑧∗ − 𝑑𝑞 − 𝑑𝐴 𝑇 𝑣∗ − 𝑑𝐺 𝑇 𝜆∗
𝐷 𝜆∗ 𝐺𝑑𝑧 + 𝐷 𝐺𝑧∗ − ℎ 𝑑𝜆 + 0𝑑𝑣 = −𝐷 𝜆∗ 𝑑𝐺𝑧∗ + 𝐷 𝜆∗ 𝑑ℎ
𝐴𝑑𝑧 + 𝟎𝑑𝜆 + 𝟎𝑑𝑣 = −𝑑𝐴𝑧∗ + 𝑑𝑏
An efficient batched QP solver
• Deep networks are trained in mini-batches to take advantage of
efficient data-parallel GPU operation.
• Modern state-of-the-art QP solvers do not have the capability of
solving problems on the GPU in parallel across the entire mini-batch.
• Implement a GPU-based primal-dual interior-point method based on
Mattingley and Boyd (2012).
• Solve a batch of QP and provide the necessary gradients .
Example: Sudoku(4 × 4)
• Equality constraints:
• Let 𝑧 ∈ 𝑅4×4×4, 𝑧𝑖𝑗𝑘 = 1 if the number in 𝑖’s row, 𝑗’s column is 𝑘.
• Every single cell contains only one unique number.
• 𝑧𝑖𝑗1 + 𝑧𝑖𝑗2 + 𝑧𝑖𝑗3 + 𝑧𝑖𝑗4 = 1 for every row 𝑖, column 𝑗.
• Every single row contains only one unique number.
• ቐ
𝑧11𝑘 + 𝑧12𝑘 + 𝑧13𝑘 + 𝑧14𝑘 = 1
⋮
𝑧41𝑘 + 𝑧42𝑘 + 𝑧43𝑘 + 𝑧44𝑘 = 1
for k in [1,2,3,4]
1,1 1,2 1,3 1,4
2,1 2,2 2,3 2,4
3,1 3,2 3,3 3,4
4,1 4,2 4,3 4,4
• Every single column contains only one unique number.
• ቐ
𝑧11𝑘 + 𝑧21𝑘 + 𝑧31𝑘 + 𝑧41𝑘 = 1
⋮
𝑧14𝑘 + 𝑧24𝑘 + 𝑧34𝑘 + 𝑧44𝑘 = 1
for k in [1,2,3,4]
• Every single 2 × 2 subgrid contains only one unique number.
• ቐ
𝑧11𝑘 + 𝑧12𝑘 + 𝑧21𝑘 + 𝑧22𝑘 = 1
⋮
𝑧33𝑘 + 𝑧34𝑘 + 𝑧43𝑘 + 𝑧44𝑘 = 1
for k in [1,2,3,4]
1,1 1,2 1,3 1,4
2,1 2,2 2,3 2,4
3,1 3,2 3,3 3,4
4,1 4,2 4,3 4,4
• Transform 3d tensor z (𝑅4×4×4) to 1d array (𝑅64).
• Total number of equality constraints 𝐴 are 40 (after remove linearly
dependent rows).
• 𝐴 ∈ 𝑅40×64, 𝑏 ∈ 𝑅40 = 𝟏
• Inequality constraints:
• Every number of cell 𝑧𝑖𝑗𝑘 ≥ 0.
• 𝐺 ∈ 𝑅64×64
= −𝐼, ℎ ∈ 𝑅64
= 𝟎
• Objective:
• None, but authors set it
minimize 𝑧 𝑇
𝑄𝑧 + 𝑝𝑧
where
𝑄 ∈ 𝑅64×64 = 0.1 𝐼, 𝑝 ∈ 𝑅64 = initial board
• Before training, set 𝐴 to be random.
• During training, only give the initial board and its solution.
Experimental results
Sudoku result
Performance of Gurobi and proposed method
Conclusion
• Present QptNet, a neural network architecture where we use
optimization problems as a single layer in the network.
• Experiments show that proposed method can solve problems purely
from data.

More Related Content

Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks

  • 1. OptNet: Differentiable Optimization as a Layer in Neural Networks Brandom Amos and J. Zico Kolter CMU ICML 2017
  • 2. Abstract • Present OptNet, a network architecture integrating optimization problems as individual layers. • Develop a highly efficient solver for these layers that exploits fast GPU-based batch solving within a primal-dual interior point method. • Experiments show that OptNet have the ability to learn constraints better than other neural architectures.
  • 3. Introduction • Consider how to treat exact, constrained optimization as an individual layer within a deep learning architecture. Traditional feedforward network OptNet Target Simple function complex operations Gradient respect to loss function respect to KKT conditions Optimization Gradient descent etc. Primal-dual interior-point method
  • 4. • Consider quadratic programs (QP) minimizez 1 2 𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧 subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ • Develop a custom solver which can simultaneously solve multiple small QPs in batch form. • In total, the solver solve batched of QP over 100 times faster than existing QP solver (Gurobi).
  • 6. Matrix Calculus • scalar by vector • vector by scalar • vector by vector
  • 7. Implicit differentiation • In implicit differentiation, we differentiate each side of an equation with two variables by treating one of the variables as a function of the other. • Using the implicit differentiation, we treat 𝑦 as an implicit function of 𝑥 • 𝑥2 + 𝑦2 = 25 • ⇒ 2𝑥 + 2𝑦 𝑑𝑦 𝑑𝑥 = 0 • ⇒ 2𝑥𝑑𝑥 + 2𝑦𝑑𝑦 = 0 • ⇒ 𝑦′ = 𝑑𝑦 𝑑𝑥 = −2𝑥 2𝑦 = −𝑥 𝑦 • 𝑚 = 𝑦′ = −𝑥 𝑦 = −3 −4 = 3 4 Source: https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-2/a/implicit-differentiation-review
  • 8. KKT condition • Consider the optimization problem max f 𝐱 subject to gi 𝐱 ≤ 0 for i = 1, … , 𝑚, ℎ𝑗 𝒙 = 0 for j = 1, … , 𝑙. • If 𝒙∗ is a local optima, then exist 𝜇𝑖 (𝑖 = 1, … , 𝑚) and 𝜆𝑗 (𝑗 = 1, … , 𝑙) such that • Stationarity −𝛻𝑓 𝒙∗ = ෍ 𝑖=1 𝑚 𝜇𝑖 𝛻𝑔𝑖 𝒙∗ + ෍ 𝑗=1 𝑙 𝜆𝑗 𝛻ℎ𝑗(𝒙∗) • Primal feasibility 𝑔𝑖 𝒙∗ ≤ 0, for i = 1, … , 𝑚 ℎ𝑗 𝒙∗ = 0, for j = 1, … , 𝑙 • Dual feasibility 𝜇𝑖 ≥ 0, for i = 1, … , 𝑚 • Complementary slackness 𝜇𝑖 𝑔𝑖 𝐱∗ = 0, for i = 1, … , 𝑚 Source: https://en.m.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Source: https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf
  • 9. Primal-dual interior-point method • An algorithm to solve convex optimization problem. • More efficient compared to Barrier method. Source: http://www.cs.cmu.edu/~aarti/Class/10725_Fall17/Lecture_Slides/primal-dual.pdf
  • 11. minimizez 1 2 𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧 subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ • To obtain the derivatives of 𝑧 with respect to parameters, we differentiate the following KKT conditions. 1. Stationarity: 𝑄𝑧∗ + 𝑞 + 𝐴 𝑇 𝑣∗ + 𝐺 𝑇 𝜆∗ = 0 2. Primal feasibility: 𝐴𝑧∗ − 𝑏 = 0 3. Complementary slackness: 𝐷 𝜆∗ 𝐺𝑧∗ − ℎ = 0
  • 12. Stationarity • Recall stationarity −𝛻f 𝐱∗ = ෍ i=1 m μi 𝛻gi 𝐱∗ + ෍ j=1 l λj 𝛻hj(𝐱∗) • obtain 𝑄𝑧∗ + 𝑞 + 𝐴 𝑇 𝑣∗ + 𝐺 𝑇 𝜆∗ = 𝟎 • By implicit differentiation • ⇒ 𝑑𝑄𝑧∗ + 𝑄𝑑𝑧 + 𝑑𝑞 + 𝑑𝐴 𝑇 𝑣∗ + 𝐴 𝑇 𝑑𝑣 + 𝑑𝐺 𝑇 𝜆∗ + 𝐺 𝑇 𝑑𝜆 = 𝟎 • ⇒ 𝑄𝑑𝑧 + 𝐺 𝑇 𝑑𝜆 + 𝐴 𝑇 𝑑𝑣 = −𝑑𝑄𝑧∗ − 𝑑𝑞 − 𝑑𝐴 𝑇 𝑣∗ − 𝑑𝐺 𝑇 𝜆∗ minimizez 1 2 𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧 subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
  • 13. Primal feasibility • Recall primal feasibility • 𝑔𝑖 𝒙∗ ≤ 0, for i = 1, … , 𝑚 • ℎ𝑗 𝒙∗ = 0, for j = 1, … , 𝑙 • obtain 𝐴𝑧∗ − 𝑏 = 𝟎 (only take equality constraints) • By implicit differentiation • ⇒ 𝑑𝐴𝑧∗ + 𝐴𝑑𝑧 − 𝑑𝑏 = 𝟎 • ⇒ 𝐴𝑑𝑧 + 𝟎𝑑𝜆 + 𝟎𝑑𝑣 = −𝑑𝐴𝑧∗ + 𝑑𝑏 minimizez 1 2 𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧 subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
  • 14. Complementary slackness • Recall complementary slackness 𝜇𝑖 𝑔𝑖 𝐱∗ = 0, for i = 1, … , 𝑚 • obtain minimizez 1 2 𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧 subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
  • 15. ⇒ 𝐷 𝜆∗ 𝐺𝑑𝑧 + 𝐷 𝐺𝑧∗ − ℎ 𝑑𝜆 + 0𝑑𝑣 = −𝐷 𝜆∗ 𝑑𝐺𝑧∗ + 𝐷 𝜆∗ 𝑑ℎ by *
  • 16. *
  • 17. Matrix form ቐ 𝑄𝑑𝑧 + 𝐺 𝑇 𝑑𝜆 + 𝐴 𝑇 𝑑𝑣 = −𝑑𝑄𝑧∗ − 𝑑𝑞 − 𝑑𝐴 𝑇 𝑣∗ − 𝑑𝐺 𝑇 𝜆∗ 𝐷 𝜆∗ 𝐺𝑑𝑧 + 𝐷 𝐺𝑧∗ − ℎ 𝑑𝜆 + 0𝑑𝑣 = −𝐷 𝜆∗ 𝑑𝐺𝑧∗ + 𝐷 𝜆∗ 𝑑ℎ 𝐴𝑑𝑧 + 𝟎𝑑𝜆 + 𝟎𝑑𝑣 = −𝑑𝐴𝑧∗ + 𝑑𝑏
  • 18. An efficient batched QP solver • Deep networks are trained in mini-batches to take advantage of efficient data-parallel GPU operation. • Modern state-of-the-art QP solvers do not have the capability of solving problems on the GPU in parallel across the entire mini-batch. • Implement a GPU-based primal-dual interior-point method based on Mattingley and Boyd (2012). • Solve a batch of QP and provide the necessary gradients .
  • 19. Example: Sudoku(4 × 4) • Equality constraints: • Let 𝑧 ∈ 𝑅4×4×4, 𝑧𝑖𝑗𝑘 = 1 if the number in 𝑖’s row, 𝑗’s column is 𝑘. • Every single cell contains only one unique number. • 𝑧𝑖𝑗1 + 𝑧𝑖𝑗2 + 𝑧𝑖𝑗3 + 𝑧𝑖𝑗4 = 1 for every row 𝑖, column 𝑗. • Every single row contains only one unique number. • ቐ 𝑧11𝑘 + 𝑧12𝑘 + 𝑧13𝑘 + 𝑧14𝑘 = 1 ⋮ 𝑧41𝑘 + 𝑧42𝑘 + 𝑧43𝑘 + 𝑧44𝑘 = 1 for k in [1,2,3,4] 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4
  • 20. • Every single column contains only one unique number. • ቐ 𝑧11𝑘 + 𝑧21𝑘 + 𝑧31𝑘 + 𝑧41𝑘 = 1 ⋮ 𝑧14𝑘 + 𝑧24𝑘 + 𝑧34𝑘 + 𝑧44𝑘 = 1 for k in [1,2,3,4] • Every single 2 × 2 subgrid contains only one unique number. • ቐ 𝑧11𝑘 + 𝑧12𝑘 + 𝑧21𝑘 + 𝑧22𝑘 = 1 ⋮ 𝑧33𝑘 + 𝑧34𝑘 + 𝑧43𝑘 + 𝑧44𝑘 = 1 for k in [1,2,3,4] 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4
  • 21. • Transform 3d tensor z (𝑅4×4×4) to 1d array (𝑅64). • Total number of equality constraints 𝐴 are 40 (after remove linearly dependent rows). • 𝐴 ∈ 𝑅40×64, 𝑏 ∈ 𝑅40 = 𝟏 • Inequality constraints: • Every number of cell 𝑧𝑖𝑗𝑘 ≥ 0. • 𝐺 ∈ 𝑅64×64 = −𝐼, ℎ ∈ 𝑅64 = 𝟎
  • 22. • Objective: • None, but authors set it minimize 𝑧 𝑇 𝑄𝑧 + 𝑝𝑧 where 𝑄 ∈ 𝑅64×64 = 0.1 𝐼, 𝑝 ∈ 𝑅64 = initial board • Before training, set 𝐴 to be random. • During training, only give the initial board and its solution.
  • 25. Performance of Gurobi and proposed method
  • 26. Conclusion • Present QptNet, a neural network architecture where we use optimization problems as a single layer in the network. • Experiments show that proposed method can solve problems purely from data.