Multidimensional Newton-Raphson Consensus For Distributed Convex Optimization
Multidimensional Newton-Raphson Consensus For Distributed Convex Optimization
Multidimensional Newton-Raphson Consensus For Distributed Convex Optimization
223866
FeedNetBack, by Progetto di Ateneo CPDA090135/09 funded by the
University of Padova, and by the Italian PRIN Project New Methods
and Algorithms for Identication and Adaptive Control of Technological
Systems.
can be based on incremental gradients methods [5] with
deterministic [6] or randomized [7] approaches, and they
may use opportune projection steps to account for possible
constraints [8].
Decomposition methods instead operate manipulating the
dual problem, usually splitting it into simpler sub-tasks that
require the agents to own local copies of the to-be-updated
variables. Convergence to the global optimum is ensured
constraining the local variables to converge to a common
value [9]. In the class of dual decomposition methods, a par-
ticularly popular strategy is the Alternating Direction Method
of Multipliers (ADMM) developed in [1, pp. 253-261] and
recently proposed in various distributed contexts [10], [11].
An other interesting approach, suitable only for particular
optimization problems, is to use the so-called Fast-Lipschitz
methods [12], [13]. These exploit particular structures of the
objective functions and constraints to increase the conver-
gence speed. Alternative distributed optimization approaches
are based on heuristics like swarm optimization [14] or
genetic algorithms [15]. However their convergence and
performance properties are difcult to be studied analytically.
Statement of contribution: here we focus on the uncon-
strained minimization of a sum of multidimensional convex
functions, where each component of the global function is a
private local cost available only to a specic agent. We thus
offer a distributed algorithm that approximatively operates as
a Newton-Raphson minimization procedure, and then derive
two approximated versions that trade-off between the re-
quired communication bandwidth and the convergence speed
/ accuracy of the results. For these strategies we provide
convergence proofs and analysis on the robustness on initial
conditions of the algorithms, under the assumptions that
local cost functions are convex and smooth, and that com-
munication schemes are synchronous. The main algorithm
is an extension of what has been proposed in [16], while
the approximated versions are completely novel. We notice
that communications between agents are based on classical
average-consensus algorithms [17]. The offered algorithms
inherit thus the good properties of consensus algorithms,
namely their simplicity, their potential implementation with
asynchronous communication schemes, and their ability to
adapt to time-varying network topologies.
Structure of the paper: in Sec. II we formulate the
problem from a mathematical point of view. In Sec. III we
derive the main generic distributed algorithm, from which we
derive three different and specic instances in Sections IV, V
and VI. In Sec. VII we briey discuss the properties of these
algorithms, and then in Sec. VIII we show their effectiveness
2012 American Control Conference
Fairmont Queen Elizabeth, Montral, Canada
June 27-June 29, 2012
978-1-4577-1094-0/12/$26.00 2012 AACC 1079
by means of numerical examples. Finally in Sec. IX we draw
some concluding remarks
1
.
II. PROBLEM FORMULATION
We assume that S agents, each endowed with the local N-
dimensional and strictly convex cost function f
i
: R
N
i=1
f
i
(x) (1)
where x := [x
1
x
N
]
T
is the generic element in R
N
.
Agents thus want to distributedly compute
x
:= arg min
x
f (x) (2)
exploiting low-complexity distributed optimization algo-
rithms. As in [16], we model the communication network
as a graph G = (V, E) whose vertexes V = {1, 2, . . . , S}
represent the agents and the edges (i, j) E represent the
available communication links. We assume that the graph is
undirected and connected. We say that a stochastic matrix
P R
SS
, i.e. a matrix whose elements are non-negative
and P1
S
= 1
S
, where 1
S
:= [1 1 1]
T
R
S
, is
consistent with a graph G if P
ij
> 0 only if (i, j) E.
If P is also symmetric and includes all edges, i.e. P
ij
> 0
if (i, j) E, then lim
k
P
k
=
1
S
1
S
1
T
S
. Such matrix P is
also often referred as a consensus matrix.
In the following we use x
i
(k) := [x
i,1
(k) x
i,N
(k)]
T
to indicate the input location of agent i at time k, and
operator to indicate differentiation w.r.t. x, i.e.
f
i
(x
i
(k)) :=
_
f
i
x
1
xi(k)
f
i
x
N
xi(k)
_
T
(3)
2
f
i
(x
i
(k)) :=
_
2
f
i
x
m
x
n
xi(k)
_
_
. (4)
In general we use the fraction bar to indicate the Hadamard
division, i.e. the component-wise division of vectors a, b
R
N
a
b
:=
_
a
1
b
1
, ,
a
N
b
N
_
T
. (5)
In general we use bold fonts to indicate vectorial quantities
or functions which range is vectorial, plain italic fonts to
indicate scalar quantities or functions which range is a scalar.
We use capital italic fonts to indicate matrix quantities
and capital bold fonts to indicate matrix quantities derived
stacking other matrix quantities. As in [16], to simplify the
proofs we exploit the following assumption, implying that
x
is unique:
Assumption 1. Local functions f
i
belongs to C
2
, i, i.e.
they are continuous up to the second partial derivatives, their
1
The proofs of the proposed propositions can be found in the homony-
mous technical report available on the authors webpages.
second partial derivatives are strictly positive, bounded, and
they are dened for all x R
N
. Moreover each scalar
component of the global minimizer x
_
a
(i)
11
a
(i)
1M
.
.
.
.
.
.
a
(i)
N1
a
(i)
NM
_
_
i = 1, . . . , S
to be S generic N M matrices associated to agents
1, . . . , S, and that these agents want to distributedly compute
1
S
S
i=1
A
i
by means of the double-stochastic communica-
tion matrix P. In the following sections, to indicate the whole
set of the single component-wise steps
_
_
a
(1)
pq
(k + 1)
.
.
.
a
(S)
pq
(k + 1)
_
_
= P
_
_
a
(1)
pq
(k)
.
.
.
a
(S)
pq
(k)
_
_
p = 1, . . . , N
q = 1, . . . , M
(6)
we use the equivalent matricial notation
_
_
A
1
(k + 1)
.
.
.
A
S
(k + 1)
_
_ = (P I
N
)
_
_
A
1
(k)
.
.
.
A
S
(k)
_
_ (7)
where I
N
is the identity in R
NN
and is the Kronecker
product. Notice that the notation is suited also for vectorial
quantities, e.g. A
i
R
N
.
III. DISTRIBUTED MULTIDIMENSIONAL
CONSENSUS-BASED OPTIMIZATION
Assume the local cost functions to be quadratic, i.e.
f
i
(x) =
1
2
(x b
i
)
T
A
i
(x b
i
)
where A
i
> 0. Straightforward computations show that the
unique minimizer of f is given by
x
=
_
1
S
S
i=1
A
i
_
1
_
1
S
S
i=1
A
i
b
i
_
and can thus be computed using the output of two average
consensus algorithms. Dening the local variables
y
i
(0) := A
i
b
i
R
N
Z
i
(0) := A
i
R
NN
1080
and the corresponding compact forms
Y (k):=
_
_
y
1
(k)
.
.
.
y
S
(k)
_
_ R
NS
Z(k):=
_
_
Z
1
(k)
.
.
.
Z
S
(k)
_
_ R
NSN
then the algorithm
Y (k + 1) =
_
P I
N
_
Y (k) (8)
Z(k + 1) =
_
P I
N
_
Z(k) (9)
x
i
(k) = (Z
i
(k))
1
y
i
(k) i = 1, . . . , S (10)
alternates average-consensus steps (i.e. (8) and (9), given the
considerations in Sec. II-A) with local updates (i.e. (10)),
and is s.t. lim
k
x
i
(k) = x
. The element x
i
(k) can thus
be considered the local estimate of the global minimizer x
S
i=1
x
i
approximatively follows the
update rule
x(t) = x(t) +
_
1
S
S
i=1
H
i
(x(t))
_
1
_
1
S
S
i=1
g
i
(x(t))
_
(see proof of Prop. 2). In the following we show that this
property is appealing since, exploiting proper choices of
g
i
(k) and H
i
(k), we can obtain distributed optimization
algorithms with desirable properties such as convergence
to the global optimum and small communication bandwidth
requirements.
IV. DISTRIBUTED MULTIDIMENSIONAL
NEWTON-RAPHSON
Consider the following Alg. 2, based on the general
layout given by Alg. 1. We show now how it corresponds
to the multidimensional extension of the distributed scalar
optimizer described in [16], and that it distributedly computes
the global optimum x
_
g
1
(k)
.
.
.
g
S
(k)
_
_ R
NS
H(k) :=
_
_
H
1
(k)
.
.
.
H
S
(k)
_
_ R
NSN
Algorithm 2 Distributed Newton-Raphson
Execute Alg. 1 with denitions
g
i
(k) :=
2
f
i
(x
i
(k)) x
i
(k) f
i
(x
i
(k)) R
N
H
i
(k) :=
2
f
i
(x
i
(k)) R
NN
.
The rst step is then to introduce the additional variables
V (k) = G(k 1) and W(k) = H(k 1) and rewrite Alg. 2
1081
as
V (k) = G(k 1)
W(k) = H(k 1)
Y (k) = (P I
N
)
_
Y (k1)+G(k1)V (k1)
_
Z(k) = (P I
N
)
_
Z(k1)+H(k1)W(k1)
_
x
i
(k) = (1)x
i
(k 1)+ (Z
i
(k))
1
y
i
(k)
(11)
from which it is possible to recognize the tracking of the
quantities x
i
(k) plus the consensus step (1
st
to 4
th
rows) and
the local smooth updates (5
th
row). (11) can be considered
the Euler discretization, with time interval T = , of the
continuous time system
V (t) = V (t) +G(t)
W(t) = W(t) +H(t)
Y (t) = KY (t) + (I
NS
K) [G(t) V (t)]
Z(t) = KZ(t) + (I
NS
K) [H(t) W(t)]
x
i
(t) = x
i
(t) + (Z
i
(t))
1
y
i
(t)
(12)
with K := I
NS
(PI
N
). It is immediate to show that K is
positive semidenite, its kernel is generated by 1
NS
, and that
its eigenvalues satisfy 0 =
1
< Re [
2
] Re [
NS
] <
2, where Re [] indicates the real part of . (12) is constituted
by two dynamical subsystems with different time-scales, one
of which is regulated by the parameter . Exploiting classical
time-separation techniques [19, Chap. 11], splitting the dy-
namics in the two time scales and studying them separately
for sufciently small , it follows that the fast dynamics, i.e.
the rst four equations of (12), are s.t. x
i
(t) x(t), where
x(t) :=
1
S
S
i=1
x
i
(t), and moreover x(t) evolves with good
approximation following the ordinary differential equation
x(t) =
_
2
f (x(t))
1
f (x(t)) (13)
corresponding to a continuous Newton-Raphson algorithm
2
that we will prove to be always convergent to the global
optimum x
, i.e.
lim
k+
x
i
(k) = x
for all i.
V. DISTRIBUTED MULTIDIMENSIONAL JACOBI
Implementation of Alg. 2 requires agents to exchange infor-
mation on about O
_
N
2
_
scalars. This could be prohibitive
in multidimensional scenarios with serious communication
bandwidth constraints and large N. In these cases, to min-
imize the amount of information to be exchanged it is
meaningful to let H
i
(k) be not the whole Hessian matrix
2
f
i
(x
i
(k)), but only its diagonal. The corresponding al-
gorithm, that we call Jacobi due to the underlying diago-
nalization process, is offered in Alg. 3. We notice that this
2
Asymptotic properties of the scalar and continuous time Newton-
Raphson method can be found e.g. in [22], [23].
diagonalization process has already been used in literature,
e.g., see [24], [25], even if in conjunction with different
communication structures.
Algorithm 3 Distributed Jacobi
Execute Alg. 1 with denitions
g
i
(k) := H
i
(k) x
i
(k) f
i
(x
i
(k)) R
N
H
i
(k) :=
_
2
fi
x
2
1
xi(k)
0
.
.
.
0
2
fi
x
2
N
xi(k)
_
_
R
NN
.
Possible interpretations of the proposed approximation are:
agents perform modied second-order Taylor approxi-
mations of the local functions;
agents choose a steepest descent direction in a simplied
norm;
ellipsoids corresponding to the various Hessians
2
f
i
are approximated with ellipsoids having axes that are
parallel with the current coordinate system.
It is easy to show that this approximated strategy is
invariant over afne transformations T : R
NN
R
NN
, T invertible and s.t. f
new
(x) = f(Tx), as classical
Newton-Raphson algorithms are [26, Sec. 9.5]. It is moreover
possible to prove that also Alg. 3 ensures the convergence
to the global optimum, i.e. to prove the following (proof in
Appendix):
Proposition 3. If Assumption 1 holds true, then there exists
an
R
+
s.t. if <
, i.e.
lim
k+
x
i
(k) = x
for all i.
Analytical characterization of the convergence speed of
Alg. 2 and Alg. 3 is left as a future work.
VI. DISTRIBUTED MULTIDIMENSIONAL GRADIENT
DESCENT
We notice now that the distributed Jacobi relieves the com-
putational requirements of the distributed Newton-Raphson,
since the inversion of H
i
(x
i
(k)) corresponds to the inversion
of N scalars, but nonetheless agents still have to compute
the local second derivatives
2
fi
x
2
n
xi(k)
. If this task is still too
consuming, e.g. in cases where nodes have severe computa-
tional constraints, it is possible to redene H
i
(k) in Alg. 1
in a way that it reduces to a gradient-descent procedure, as
did in the following algorithm.
Algorithm 4 Distributed gradient-descent
Execute Alg. 1 with denitions
g
i
(k) := x
i
(k) f
i
(x
i
(k)) R
N
H
i
(k) := I
N
R
NN
.
1082
VII. DISCUSSION ON THE PREVIOUS ALGORITHMS
The costs associated to the previously proposed strategies
are summarized in Tab. I.
Algorithm 2 3 4
Computational Cost O
`
N
3
O(N) O(N)
Communication Cost O
`
N
2
O(N) O(N)
Memory Cost O
`
N
2
O(N) O(N)
TABLE I
COMPUTATIONAL, COMMUNICATION AND MEMORY COSTS OF
ALGORITHMS 2, 3, AND 4 PER SINGLE UNIT AND SINGLE STEP (LINES 6
TO 13 OF ALGORITHM 1).
We notice that the approximation of the Hessian by
neglecting the off-diagonal terms has been already proposed
in centralized approaches, e.g. [27]. Intuitively, the effect of
this diagonal approximation is the following: the full Newton
method perform both scaling and rotation of the steepest
descent step. The diagonal modied Newton method only
scales the descent step in each direction, thus the more the
directions of the maximal and minimal curvatures are aligned
with the axes, the more the approximated method captures
the curvature information and performs better.
A nal remark is that the analytic Hessian can be ap-
proximated in several ways, but in general it is necessary
to consider only approximations that maintain symmetry
and positive deniteness. In cases where this deniteness is
lacking, or matrices are bad conditioned, modications are
usually performed e.g. through Cholesky factorizations [28].
VIII. NUMERICAL EXAMPLES
We consider a ring communication graph, where agents
can communicate only to their left and right neighbors, and
thus the symmetric circulant communication matrix
P =
_
_
0.5 0.25 0.25
0.25 0.5 0.25
.
.
.
.
.
.
.
.
.
0.25 0.5 0.25
0.25 0.25 0.5
_
_
. (14)
We consider S = 15, N = 2, and local objective functions
randomly generated as
f
i
(x) = exp
_
(x b
i
)
T
A
i
(x b
i
)
_
, i = 1, . . . , S
where b
i
[U [5, 5] , U [5, 5]]
T
, A
i
= D
i
D
T
i
> 0, and
D
i
:=
_
d
11
d
12
d
21
d
22
_
R
22
. (15)
We compare the performances of the previous algorithms
in the following three different scenarios:
S
1
:
_
_
d
11
= d
22
U [0.08, 0.08] R[1, 1]
d
12
U [0.08, 0.08] R[0.25, 0.5]
d
21
U [0.08, 0.08] R[0.5, 0.25]
(16)
where the R-distribution as:
R[c, d] :=
_
c with probability 1/2
d with probability 1/2
i.e. the axes of each contour plot are randomly oriented in
the 2-D plane.
S
2
:
_
_
d
11
U [0.08, 0.08]
d
12
= d
21
= 0
d
22
= 2 d
11
(17)
i.e. the axes of all the contour plots of the f
i
surfaces are
aligned with the axes of the natural reference system.
S
3
:
_
_
d
11
U [0.08, 0.08]
d
12
= d
21
= 0.01
d
22
R[0.9, 1.1] d
11
(18)
i.e. the axes of each contour plot are randomly oriented along
the bisection of the rst and third quadrant.
The contour plots of the global cost functions
fs gener-
ated using (16), (17) and (18), and the evolution of the local
states x
i
for the three algorithms are shown in Fig. 1.
We notice that Alg. 2 and Alg. 3 have qualitatively the
same behavior for the scenarios (16) and (17). This is
because the approximation introduced in Alg. 3 is actually a
good approximation of the analytical Hessians
2
f
i
(x
i
(k)).
Conversely, Alg. 4 presents a remarkably slower convergence
rate. Since the computational time of Alg. 3 and 4 are
comparable, Alg. 3 seems to represent the best choice among
all the presented solutions.
IX. CONCLUSIONS AND FUTURE WORKS
Starting from [16], we offered a multidimensional dis-
tributed convex optimization algorithm that behaves approxi-
matively as a Newton-Raphson procedure. We then proposed
two approximated versions of the main algorithm to take
into account the possible computational, communication and
memory constraints that may arise in practical scenarios.
We produced proofs of convergence under the assumptions
of dealing with smooth convex functions, and numerical
simulations to compare the performances of the proposed
algorithms.
Currently there are many open future research directions.
A rst branch is about the analytical characterization of the
speeds of convergence of the proposed strategies, while an
other one is about the application of quasi-Newton methods
to avoid the computation of the Hessians and the use of
trust region methods. Finally, an important future extension
is to allow the strategy to be implemented in asynchronous
communication frameworks.
1083
15 10 5 0 5 10 15
15
10
5
0
5
10
15
x1
x
2
0 20 40 60 80 100
1.5
1
0.5
0
0.5
1
1.5
2
Distributed Newton-Raphson
x
1
,
i
(
k
)
0 20 40 60 80 100
1.5
1
0.5
0
0.5
1
1.5
2
Distributed Jacobi
0 2000 4000 6000
1.5
1
0.5
0
0.5
1
1.5
2
Distributed gradient descent
15 10 5 0 5 10 15
15
10
5
0
5
10
15
x1
x
2
0 20 40 60 80 100
3
2
1
0
1
2
x
1
,
i
(
k
)
0 20 40 60 80 100
3
2
1
0
1
2
0 2000 4000 6000
3
2
1
0
1
2
15 10 5 0 5 10 15
15
10
5
0
5
10
15
x1
x
2
0 20 40 60 80 100
1.5
1
0.5
0
0.5
1
1.5
2
k (time steps)
x
1
,
i
(
k
)
0 20 40 60 80 100
1.5
1
0.5
0
0.5
1
1.5
2
k (time steps)
2.000 4.000 6.000
1.5
1
0.5
0
0.5
1
1.5
2
k (time steps)
Fig. 1. First column on the left, contours plot of global function
fs for scenarios S
1
, S
2
, S
3
, respectively (from top to bottom). Black dots indicate
the positions of the global minima x
. Second, third and fourth columns, temporal evolutions of the rst components of the local states x
1
, for the case
= 0.25 and N = 15. In particular: second column, distributed Newton-Raphson (Alg. 2). Third column, distributed Jacobi (Alg. 3). Fourth column,
distributed gradient descent (Alg. 4). First row, scenario S
1
. Second row, scenario S
2
. Third row, scenario S
3
. The black dashed lines indicate the rst
components of the global optima x
. Notice that we show a bigger number of time steps for the gradient descent algorithm (fourth column).
REFERENCES
[1] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Compu-
tation: Numerical Methods. Athena Scientic, 1997.
[2] D. P. Bertsekas, Network Optimization: Continuous and Discrete
Models. Belmont, Massachusetts: Athena Scientic, 1998.
[3] K. C. Kiwiel, Convergence of approximate and incremental subgra-
dient methods for convex optimization, SIAM J. on Optim., vol. 14,
no. 3, pp. 807 840, 2004.
[4] B. Johansson, On distributed optimization in networked systems,
Ph.D. dissertation, KTH Electrical Engineering, 2008.
[5] A. Nedi c and D. P. Bertsekas, Incremental subgradient methods for
nondifferentiable optimization, SIAM J. on Optim., vol. 12, no. 1, pp.
109 138, 2001.
[6] D. Blatt, A. Hero, and H. Gauchman, A convergent incremental
gradient method with a constant step size, SIAM J. on Optim., vol. 18,
no. 1, pp. 29 51, 2007.
[7] S. S. Ram, A. Nedi c, and V. Veeravalli, Incremental stochastic
subgradient algorithms for convex optimzation, SIAM J. on Optim.,
vol. 20, no. 2, pp. 691 717, 2009.
[8] A. Nedi c, A. Ozdaglar, and P. A. Parrilo, Constrained consensus and
optimization in multi-agent networks, IEEE TAC, vol. 55, no. 4, pp.
922 938, 2010.
[9] L. Xiao, M. Johansson, and S. Boyd, Simultaneous routing and
resource allocation via dual decomposition, IEEE Trans. on Comm.,
vol. 52, no. 7, pp. 1136 1144, 2004.
[10] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, Consensus in ad hoc
WSNs with noisy links - part I: Distributed estimation of deterministic
signals, IEEE Trans. on Sig. Proc., vol. 56, pp. 350 364, 2008.
[11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed
optimization and statistical learning via the alternating direction
method of multipliers, Stanford Statistics Dept., Tech. Rep., 2010.
[12] C. Fischione, F-Lipschitz optimization with Wireless Sensor Net-
works applications, IEEE TAC, vol. to appear, pp. , 2011.
[13] C. Fischione and U. Jnsson, Fast-Lipschitz optimization with Wire-
less Sensor Networks applications, in IPSN, 2011.
[14] J. Van Ast, R. Babka, and B. D. Schutter, Particle swarms in
optimization and control, in IFAC World Congress, 2008.
[15] E. Alba and J. M. Troya, A survey of parallel distributed genetic
algorithms, Complexity, vol. 4, no. 4, pp. 31 52, 1999.
[16] F. Zanella, D. Varagnolo, A. Cenedese, G. Pillonetto, and L. Schenato,
Newton-Raphson consensus for distributed convex optimization, in
IEEE Conference on Decision and Control, 2011.
[17] F. Garin and L. Schenato, Networked Control Systems. Springer,
2011, ch. A survey on distributed estimation and control applications
using linear consensus algorithms, pp. 75107.
[18] P. Kokotovi c, H. K. Khalil, and J. OReilly, Singular Perturbation
Methods in Control: Analysis and Design, ser. Classics in applied
mathematics. SIAM, 1999, no. 25.
[19] H. K. Khalil, Nonlinear Systems, 3rd ed. Prentice Hall, 2001.
[20] Y. C. Ho, L. Servi, and R. Suri, A class of center-free resource
allocation algorithms, Large Scale Systems, vol. 1, pp. 51 62, 1980.
[21] L. Xiao and S. Boyd, Optimal scaling of a gradient method for
distributed resource allocation, J. Opt. Theory and Applications, vol.
129, no. 3, pp. 469 488, 2006.
[22] K. Tanabe, Global analysis of continuous analogues of the Levenberg-
Marquardt and Newton-Raphson methods for solving nonlinear equa-
tions, Inst. of Stat. Math., vol. 37, pp. 189203, 1985.
[23] R. Hauser and J. Nedi c, The continuous Newton-Raphson method
can look ahead, SIAM J. on Opt., vol. 15, pp. 915 925, 2005.
[24] S. Athuraliya and S. H. Low, Optimization ow control with newton-
like algorithm, Telecommunication Systems, vol. 15, no. 3-4, pp. 345
358, 2000.
[25] M. Zargham, A. Ribeiro, A. Ozdaglar, and A. Jadbabaie, Accelerated
dual descent for network optimization, in ACC, 2011.
[26] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge
University Press, 2004.
[27] S. Becker and Y. L. Cun, Improving the convergence of back-
propagation learning with second order models, University of
Toronto, Tech. Rep. CRG-TR-88-5, September 1988.
[28] G. H. Golub and C. F. Van Loan, Matrix Computations. Johns
Hopkins University Press, 1996, sec. 4.2.
1084