2. Abstract
• Present OptNet, a network architecture integrating optimization
problems as individual layers.
• Develop a highly efficient solver for these layers that exploits fast
GPU-based batch solving within a primal-dual interior point method.
• Experiments show that OptNet have the ability to learn constraints
better than other neural architectures.
3. Introduction
• Consider how to treat exact, constrained optimization as an individual
layer within a deep learning architecture.
Traditional feedforward
network
OptNet
Target Simple function complex operations
Gradient respect to loss function respect to KKT conditions
Optimization Gradient descent etc.
Primal-dual interior-point
method
4. • Consider quadratic programs (QP)
minimizez
1
2
𝑧 𝑇 𝑄𝑧 + 𝑞 𝑇 𝑧
subject to 𝐴𝑧 = 𝑏, 𝐺𝑧 ≤ ℎ
• Develop a custom solver which can simultaneously solve multiple
small QPs in batch form.
• In total, the solver solve batched of QP over 100 times faster than
existing QP solver (Gurobi).
7. Implicit differentiation
• In implicit differentiation, we differentiate each side of an equation with
two variables by treating one of the variables as a function of the other.
• Using the implicit differentiation, we treat 𝑦 as an implicit function of 𝑥
• 𝑥2 + 𝑦2 = 25
• ⇒ 2𝑥 + 2𝑦
𝑑𝑦
𝑑𝑥
= 0
• ⇒ 2𝑥𝑑𝑥 + 2𝑦𝑑𝑦 = 0
• ⇒ 𝑦′ =
𝑑𝑦
𝑑𝑥
=
−2𝑥
2𝑦
=
−𝑥
𝑦
• 𝑚 = 𝑦′ =
−𝑥
𝑦
=
−3
−4
=
3
4
Source: https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-2/a/implicit-differentiation-review
8. KKT condition
• Consider the optimization problem
max f 𝐱
subject to
gi 𝐱 ≤ 0 for i = 1, … , 𝑚,
ℎ𝑗 𝒙 = 0 for j = 1, … , 𝑙.
• If 𝒙∗ is a local optima, then exist 𝜇𝑖 (𝑖 = 1, … , 𝑚) and 𝜆𝑗 (𝑗 = 1, … , 𝑙) such that
• Stationarity
−𝛻𝑓 𝒙∗ =
𝑖=1
𝑚
𝜇𝑖 𝛻𝑔𝑖 𝒙∗ +
𝑗=1
𝑙
𝜆𝑗 𝛻ℎ𝑗(𝒙∗)
• Primal feasibility
𝑔𝑖 𝒙∗ ≤ 0, for i = 1, … , 𝑚
ℎ𝑗 𝒙∗
= 0, for j = 1, … , 𝑙
• Dual feasibility
𝜇𝑖 ≥ 0, for i = 1, … , 𝑚
• Complementary slackness
𝜇𝑖 𝑔𝑖 𝐱∗ = 0, for i = 1, … , 𝑚
Source: https://en.m.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
Source: https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf
9. Primal-dual interior-point method
• An algorithm to solve convex optimization problem.
• More efficient compared to Barrier method.
Source: http://www.cs.cmu.edu/~aarti/Class/10725_Fall17/Lecture_Slides/primal-dual.pdf
18. An efficient batched QP solver
• Deep networks are trained in mini-batches to take advantage of
efficient data-parallel GPU operation.
• Modern state-of-the-art QP solvers do not have the capability of
solving problems on the GPU in parallel across the entire mini-batch.
• Implement a GPU-based primal-dual interior-point method based on
Mattingley and Boyd (2012).
• Solve a batch of QP and provide the necessary gradients .
19. Example: Sudoku(4 × 4)
• Equality constraints:
• Let 𝑧 ∈ 𝑅4×4×4, 𝑧𝑖𝑗𝑘 = 1 if the number in 𝑖’s row, 𝑗’s column is 𝑘.
• Every single cell contains only one unique number.
• 𝑧𝑖𝑗1 + 𝑧𝑖𝑗2 + 𝑧𝑖𝑗3 + 𝑧𝑖𝑗4 = 1 for every row 𝑖, column 𝑗.
• Every single row contains only one unique number.
• ቐ
𝑧11𝑘 + 𝑧12𝑘 + 𝑧13𝑘 + 𝑧14𝑘 = 1
⋮
𝑧41𝑘 + 𝑧42𝑘 + 𝑧43𝑘 + 𝑧44𝑘 = 1
for k in [1,2,3,4]
1,1 1,2 1,3 1,4
2,1 2,2 2,3 2,4
3,1 3,2 3,3 3,4
4,1 4,2 4,3 4,4
20. • Every single column contains only one unique number.
• ቐ
𝑧11𝑘 + 𝑧21𝑘 + 𝑧31𝑘 + 𝑧41𝑘 = 1
⋮
𝑧14𝑘 + 𝑧24𝑘 + 𝑧34𝑘 + 𝑧44𝑘 = 1
for k in [1,2,3,4]
• Every single 2 × 2 subgrid contains only one unique number.
• ቐ
𝑧11𝑘 + 𝑧12𝑘 + 𝑧21𝑘 + 𝑧22𝑘 = 1
⋮
𝑧33𝑘 + 𝑧34𝑘 + 𝑧43𝑘 + 𝑧44𝑘 = 1
for k in [1,2,3,4]
1,1 1,2 1,3 1,4
2,1 2,2 2,3 2,4
3,1 3,2 3,3 3,4
4,1 4,2 4,3 4,4
21. • Transform 3d tensor z (𝑅4×4×4) to 1d array (𝑅64).
• Total number of equality constraints 𝐴 are 40 (after remove linearly
dependent rows).
• 𝐴 ∈ 𝑅40×64, 𝑏 ∈ 𝑅40 = 𝟏
• Inequality constraints:
• Every number of cell 𝑧𝑖𝑗𝑘 ≥ 0.
• 𝐺 ∈ 𝑅64×64
= −𝐼, ℎ ∈ 𝑅64
= 𝟎
22. • Objective:
• None, but authors set it
minimize 𝑧 𝑇
𝑄𝑧 + 𝑝𝑧
where
𝑄 ∈ 𝑅64×64 = 0.1 𝐼, 𝑝 ∈ 𝑅64 = initial board
• Before training, set 𝐴 to be random.
• During training, only give the initial board and its solution.
26. Conclusion
• Present QptNet, a neural network architecture where we use
optimization problems as a single layer in the network.
• Experiments show that proposed method can solve problems purely
from data.