Primal-Dual Strategy (PDS) for Composite Optimization Over Directed graphs

Sajad Zandi and Mehdi Korki S. Zandi is with faculty of Engineering and IT, University of Technology Sydney, Ultimo, 2007, Australia ( email: sajad.zandi@student.uts.edu.au).M. Korki is with School of Science, Computing and Engineering Technologies, Swinburne University of Technology, Hawthorn, 3122, Australia ( e-mail: mkorki@swin.edu.au).

Abstract

We investigate the distributed multi-agent sharing optimization problem in a directed graph, with a composite objective function consisting of a smooth function plus a convex (possibly non-smooth) function shared by all agents. While adhering to the network connectivity structure, the goal is to minimize the sum of smooth local functions plus a non-smooth function. The proposed Primal-Dual algorithm (PD) is similar to a previous algorithm [43], but it has additional benefits. To begin, we investigate the problem in directed graphs, where agents can only communicate in one direction and the combination matrix is not symmetric. Furthermore, the combination matrix is changing over time, and the condition coefficient weights are produced using an adaptive approach. The strong convexity assumption, adaptive coefficient weights, and a new upper bound on step-sizes are used to demonstrate that linear convergence is possible. New upper bounds on step-sizes are derived under the strong convexity assumption and adaptive coefficient weights that are time-varying in the presence of both smooth and non-smooth terms. Simulation results show the efficacy of the proposed algorithm compared to some other algorithms.

Index Terms:

Primal-Dual algorithm (PD), directed graph, adaptive coefficient weights.

I Introduction

In a decentralized approach, $N$ agents, which only exchange information with their immediate neighbors, are connected to form a multi-agent cooperative network for estimating an unknown parameter through in-network processing. The goal of this work is to modify and utilize the multi-agent sharing optimization problems [43], in a distributed approach. In recent years, there has been a surge of interest in distributed optimization. It is also a solution to minimize the average/sum of $n$ convex functions by a network of $n$ nodes, where a single agent $i$ has only access to a private function $f_{i}$ . This problem frequently arises in formation control [13] and non-autonomous power control [14]. The coupled multi-agent sharing optimization problem includes two functions [43]: the sum of smooth local functions plus a convex (possibly non-smooth) function coupling all agents in the network. Agents aim to find the solution $w_{k}^{\star}\in\mathbb{R}^{Q_{k}}$ to the following problem:

\min_{w_{1},....,w_{N}}\sum_{k=1}^{N}J_{k}(w_{k})+\mathrm{D}(w),

(1)

where function $J_{k}(w_{k}):\mathbb{R}^{Q_{k}}\rightarrow\mathbb{R}$ is only known to agent $k$ , and non-smooth function $\mathrm{D}(w)$ is known by all agents. Note that $\mathrm{D}(w)=g(\sum_{k=1}^{N}C_{k}w_{k}):\mathbb{R}^{M}\rightarrow\mathbb{R}% \cup(+\infty)$ is a convex (possibly non-smooth) function, where the matrix $C_{k}\in\mathbb{R}^{M\times Q_{k}}$ is known by agent $k$ only. The coupled multi-agent optimization problem of (1) represents the sharing formulation, where the agents (with their various variables) are coupled through $\mathrm{D}(w)$ . The applications of problem (1) include regression over distributed features [1], [2], dictionary learning over distributed models [3], clustering in graphs [4], smart grid control [5], and network utility maximization [6], to name a few.

The authors in [43] proposed a decentralized algorithm for (1) and established its linear convergence to the global solution. They derived their algorithms based on reformulating (1) into an equivalent decentralized solution using an equivalent saddle-point problem. In this work, we use a similar technique to develop and derive a novel decentralized algorithm in a directed graph based on adaptive coefficient weights and a new convergence rate. Weight balance is usually not possible in real-world scenarios. As a result, we aim to improve the optimization and learning algorithms that are suitable for and applicable to directed graphs. The main challenge in dealing with directed graphs is the substitution of construction of doubly-stochastic to either row-stochastic or column-stochastic matrix, which are used as weighted adjacency matrices, in directed graphs. The row-stochasticity of the weight matrix guarantees that all agents reach consensus, while the column-stochasticity ensures optimality [7]. Most of the methods for optimization over directed graphs are used to combine the average-consensus methods with optimization algorithms designed for undirected graphs. In contrast to undirected graphs, the applications of a directed graph topology[50] are wider, and it may also result in lower communication costs and simpler topology. Combining average-consensus techniques created for directed graphs with optimization algorithms made for undirected graphs serves as the inspiration for the current methods for optimization over directed graphs. For example, subgradient-push method, described in [8] and extensively investigated in [9], combines push-sum consensus [10] and distributed gradient descent (DGD). Directed-Distributed Gradient Descent (D-DGD)[7, 12], is a linear technique over directed graphs that is based on surplus consensus [11] and DGD. However, due to the diminishing step-size, such DGD-based algorithms converge relatively slowly for general convex functions and strongly convex functions.

I-A Related Works

Information exchange within a multi-agent cooperative network (through coupled neighbors) aims to estimate an unknown parameter. The applications of this cooperative learning and estimation are in various fields such as signal processing, optimization, and control [30], [31]. Information exchange can happen in a distributed or central approach. In the central method, data is collected from the whole network and is processed in the fusion center, then it is sent back to the agents. This method is powerful, however, it still has some limitations such as limited bandwidth and vulnerability of the fusion center to failure. Decentralized methods refer to fully distributed solutions. As an application, the distributed solution method enables parallel processing and the distribution of computational load across multiple agents. When the bandwidth around the central agent is low, the performance of network methods degrades. On the other hand, one advantage of decentralized methods is that convergence occurs quickly while bandwidth is limited. Furthermore, in some sensitive applications, privacy and confidentiality are barriers that prevent sending information to fusion centers. On the other hand, in a distributed model, the agents are only allowed to share information with their immediate neighbors. This can be applied in social, transportation, and biological networks [30, 51].

Many studies have made decentralized optimization algorithms in large-scale networks an interesting topic. For instance, consider [15, 16] with one agent at the center of a centralized topology that may fail or violate privacy requirements. The incremental algorithm [17, 20] may be considered as a replacement for a non-centralized topology based on a directed ring network topology. Furthermore, to manage a time-varying network, [21] proposes a distributed subgradient (DSG) algorithm with slow speed due to a decaying step-size that is required to obtain a consensus and optimal solution.

In this paper, we propose a new convergence rate for a primal-dual algorithm in a directed graph with adaptive coefficient weights.

Moreover, if $D({\mathbf{w}})$ in (1) is a non-smooth regularizer across all the agents, a proximal decentralized algorithm structure with a fixed point fits with the desired global solution. On the other hand, if $D({\mathbf{w}})=0$ , then decentralized primal methods can only converge to a biased solution. As a solution, primal methods are required with a decaying step-size to slow down the convergence rate [43]. As a result, one of the difficulties in this field is the speed of convergence, for which some methods are affected by decreasing step sizes. To accelerate convergence, we proposed a solution that utilizes strong convexity and the Liptschiz-Continuous gradients case. Some methods employed a constant step size that geometrically converges to an error ball centered on the optimal solution[34]. The authors in [22] achieve geometric convergence with the help of symmetric weights. In [23], the authors combine inexact gradient methods and a gradient estimation technique according to dynamic average consensus, where the underlying graphs must be undirected or weight-balanced according to the proposed methods. The convergence rate of some methods can be limited by decreasing step size, depending on whether the objective function is generally convex or strongly convex.

Recently, the authors in [24] proposed a fast distributed algorithm, called DEXTRA, where they have chosen an appropriate step-size which resulted in a linear convergence of the algorithm, while the objective functions are strongly convex with Liptschiz-Continuous gradients. In this method, if we choose a decreasing step-size, the convergence behavior is slow. However, if the step-size is determined by an interval and the objective functions are strongly convex with the Lipschitz-Continuous gradient, the convergence rate geometrically results in a global optimal point [48]. In ADD-OPT/push-DIGing [25], the improvement of [24] provides a geometrical convergence with a sufficiently small step size. In DEXTRA [24] and ADD-OPT/push-DIGing [25] algorithms, each node needs to know its out-degree in order to build a column-stochastic weights matrix. In [26], the authors eliminate this requirement by using row-stochastic weights and the need for out-degree knowledge, as each agent locally decides the weights assigned to incoming information. Fast methods over directed graphs share one thing in common: they are all based on push-sum type techniques. This results in a nonlinear algorithm since an independent algorithm is required to asymptotically learn either the right or the left eigenvector, which corresponds to the eigenvalue of 1 [49]. It does, however, result in additional computations and communications among the agents.

This work focuses on the case where communication between nodes is directed and time-varying. We analyze the convergence rate with a decreasing step-size for the non-smooth function. First, we discuss the problem when $D({\mathbf{w}})$ is smooth. Decentralized primal methods in [32]–[35] can not achieve an exact solution for the convergence rate unless a degraded step-size is applied. In [36], to improve the convergence rate, a decentralized method based on the alternating direction method of multipliers (ADMM) technique has been utilized. To simplify the implementation of the convergence rate, a gradient tracking method was applied in [37]–[38], then, dual-domain and accelerated dual gradient descent were utilized in [39]. In case, $D({\mathbf{w}})$ is non-smooth, a proximal gradient method is useful to compute the convergence rate, however, the increasing number of inner loops makes it computationally expensive [40]. If the non-smooth function is replaced by the indicator function of zero, then, a good convergence rate is attained [41], [42]. For the proximal decentralized algorithm, an asymptotic linear convergence exists even though all the functions are piece-wise linear-quadratic (PLQ). Since both large number of iterations and PLQ[47] are considered, a good convergence rate can be obtained.

I-B Contribution

In this paper, we develop a modified version of proximal-dual (PD) strategy [43], [44] over a distributed network, in a directed graph with adaptive coefficient weights. The purpose of this work is to reach the new convergence rates in a directed graph.

The contributions of this paper are summarized as follows:

•

In contrast to the proposed methods in the literature [43], [44], which use uniform combination coefficients, the proposed PD algorithm employs adaptive combination coefficients with new step sizes.
•

To have both a good convergence and a good performance, we optimize the adaptive combination coefficients, in a directed graph, to reduce the load of the communications and the energy of each agent.
•

Consequently, we derive new bounds for step sizes which result in a lower squared error. (See Theorem 1)

Simulation results also show that the proposed algorithm achieves a faster convergence rate.

The rest of the paper is organized as follows. In Section II, we present the problem formulation of the diffusion strategy and proposed primal-dual diffusion algorithm, and saddle points reformulation. We also provide the implementation of proximal adaptive-then-combine (ATC) for dual-diffusion algorithm and diffusion tracking method. In Section III, we present the proposed proximal dual diffusion Method, i.e., PD-dMVC and we derive the new theorem to find the new step-sizes. In Section IV, we provide the simulation results to evaluate the performance of the proposed algorithm. We conclude the paper in Section VII.

I-C Notation and Symbols

All over the paper, all vectors are column vectors, with exception of the regression vector denoted by ${\mathbf{u}}$ , which is a row vector. $I_{M}$ denotes the identity matrix of size $M\times M$ and $\textbf{1}_{N}$ denotes the $N\times 1$ vector with all entries equal to one. Lowercase letters denote vectors and uppercase letters denote matrices. $X\geq Y$ and $X>Y$ implies that $X-Y$ is positive definite matrix ( $X$ and $Y$ are both symmetric matrices with the same dimension). $col\{.\}$ denotes a column vector that is structured by stacking the elements on top of each other. $\|x\|_{C}^{2}=x^{T}Cx$ , for a vector $x\in\mathbb{R}^{M}$ and a squared matrix $C\geq 0$ . The subdifferential $\partial f(x)$ of a function $f$ at $x$ is the set of all subgradients. The proximal operator of a function $f(x)$ with step-size $\mu$ is $\operatorname{prox}_{\mu f}(x)=\underset{u}{\arg\min}f(u)+\frac{1}{2\mu}\|x-u% \|^{2}$ . The conjugate of a function $f$ is defined as $f^{*}(v)=\sup_{x}v^{\top}x-$ $f(x)$ . A differentiable function $f$ is $\delta$ -smooth and $\nu$ -stronglyconvex if $\|\nabla f(x)-\nabla f(y)\|\leq\delta\|x-y\|$ and $(x-y)^{\top}(\nabla f(x)-$ $\nabla f(y))\geq\nu\|x-y\|^{2}$ , respectively, for any $x$ and $y$ .

II Problem formulation

II-A Problem Reformulation

II-A1 Primal dual decentralized algorithms

According to some studies, adaptive then combined methods such as primal-dual recursion have larger step-size stability ranges than non-ATC methods. The primal-dual framework was the first primal-dual interpretation of ATC gradient tracking methods[ref-ref]. When the step-size is kept constant, diffusion methods solve optimization problems in the form of approximate problems that converge to biased solutions. To converge to an optimal solution, primal-based methods require the use of decaying step-size; however, decaying step-size slows down the convergence rate. As a result, primal-dual methods are proposed to obtain unbiased problem-solving and converge to exact minimizers while communications are limited. There are two kinds of problems: those with penalties and those without. The benefit of using a penalty term is that it improves the convergence rate of a decentralized algorithm while aggregate cost is conditioned and strongly convex but individuals are not. An equivalent approach is to reformulate some problems through a network and try to find interested unknown parameters. In fact, while studying a network with a general algorithmic framework as a solution to the decentralized optimization problem, minimizing a sum of local cost functions is an interesting topic. Furthermore, saddle points as a solution structure include primal-descent, dual ascend, and combine equations. In addition, the cooperation between equations is as follows: gradient-descent followed by gradient-ascend, followed by a combination step.

In this section, we consider some saddle-point problems that can reformulate (1). A decentralized algorithm cannot be obtained by directly solving the first equivalent saddle point. As a result, more reformulation is required to derive a decentralized solution in a decentralized approach to solve the problem. The network quantities are introduced as follows:

	$\displaystyle\mathcal{W}\triangleq col\{{\mathbf{w}}_{1},{\mathbf{w}}_{2},....% ,{\mathbf{w}}_{N}\}\in\mathbb{R}^{Q},Q\triangleq\sum_{k=1}^{N}Q_{k},$		(2)
	$\displaystyle\mathcal{J}(\mathcal{W})\triangleq\sum_{k=1}^{N}J_{k}({\mathbf{w}% }_{k}),$		(3)
	$\displaystyle C\triangleq\left[C_{1},\cdots,C_{K}\right]\in\mathbb{R}^{M\times Q}.$		(4)

Then, the problem (1) can be written as follows:

\displaystyle\min_{\mathcal{W}}\mathcal{J}(\mathcal{W})+g(C\mathcal{W}).

(5)

Assumption 1

(Objective Function) [43]
Problem (5) has a solution $\mathcal{W}^{\star}$ , and the function $\mathcal{J}:\mathbb{R}^{Q}\rightarrow\mathbb{R}$ is $\delta$ -smooth and $\nu$ -strongly-convex with $0<\nu\leq\delta$ . Moreover, the function $g:\mathbb{R}^{E}\rightarrow\mathbb{R}\cup\{+\infty\}$ is a proper lower semi-continuous and a convex function, and there exists $\mathcal{W}\in\mathbb{R}^{Q}$ such that $C\mathcal{W}$ belongs to the relative interior domain of $g$ .

Remark 1

Because the objective function in (5) is strongly convex, the global solution $\mathcal{W}^{\star}$ is unique.

Problem (5) is equivalent to the following saddle-point reformulation under strong-duality and Assumption (1).

\displaystyle\min_{\mathcal{W}}\max_{y}\mathcal{J}(\mathcal{W})+y^{T}C\mathcal% {W}-g^{*}(y),

(6)

where $g^{*}$ is the conjugate function of $g$ and $y$ is the dual variable. Due to the coupling effect of all agents by dual variable $y$ , which requires a fusion center to compute the dual update, solving (6) directly does not lead to a decentralized algorithm. As a result, additional reformulations are required. A local copy of dual variable $y$ at agent $k$ , i.e., $y_{k}$ , is introduced to solve (6) in a decentralized manner. The following quantities are introduced as part of the network:

	$\displaystyle\mathcal{Y}$	$\displaystyle\triangleq col\{y_{1},y_{2},....,y_{N}\}\in\mathbb{R}^{MN},% \mathcal{G}^{*}(\mathcal{Y})$	$\displaystyle\triangleq\frac{1}{N}\sum_{k=1}^{N}g^{*}(y_{k}),$
	$\displaystyle\mathcal{C}_{d}$	$\displaystyle\triangleq blkdiag{\{C_{k}\}_{k=1}^{N}}\in\mathbb{R}^{MN\times Q}.$			(7)

If we define $\mathcal{D}\in\mathbb{R}^{MN\times MN}$ as a symmetric matrix, we can write:

\displaystyle\mathcal{D}\mathcal{Y}

\displaystyle=0\Leftrightarrow y_{1}=....=y_{N}.

(8)

The saddle-point type-2 problem is written as:

\displaystyle\min_{\mathcal{W},\mathcal{X}}\max_{\mathcal{Y}}\mathcal{J}(% \mathcal{W})+\mathcal{Y}^{T}\mathcal{C}_{d}\mathcal{W}+\mathcal{Y}^{T}\mathcal% {D}\mathcal{X}-\mathcal{G}^{*}(\mathcal{Y}).

(9)

The problem formulation of (9) can be solved in a decentralized manner by using an optimal-point $(\mathcal{W}^{*},\mathcal{X}^{*},\mathcal{Y}^{*})$ .

Lemma 1

(Optimal Point) Subject to assumption .1 and (8), a primal-dual pair $(\mathcal{W}^{*},y^{*})$ and a fixed point $(\mathcal{W}^{*},\mathcal{X}^{*},\mathcal{Y}^{*})$ are optimal if, and only if, they satisfy the optimality conditions for saddle-point 1 and saddle-point 2 as below, respectively.

$\displaystyle\textbf{Saddle-Point .1:}(\mathcal{W}^{},\mathcal{Y}^{})$
$\displaystyle-C^{T}y^{*}$	$\displaystyle=\nabla\mathcal{J}(\mathcal{W}^{*}),$	(10)
$\displaystyle C\mathcal{W}^{*}$	$\displaystyle\in\partial g^{}(y^{}),$	(11)
$\displaystyle\textbf{Saddle-Point .2:}(\mathcal{W}^{},\mathcal{X}^{},% \mathcal{Y}^{*})$
$\displaystyle-\mathcal{C}_{d}^{T}\mathcal{Y}^{*}$	$\displaystyle=\nabla\mathcal{J}(\mathcal{W}^{*}),$	(12)
$\displaystyle\mathcal{D}\mathcal{Y}^{*}$	$\displaystyle=0,$	(13)
$\displaystyle\mathcal{C}_{d}\mathcal{W}^{}+\mathcal{D}\mathcal{X}^{}$	$\displaystyle\in\partial\mathcal{G}^{}(\mathcal{Y}^{}).$	(14)

Lemma 2

if ( $\mathcal{W}^{*},\mathcal{X}^{*},\mathcal{Y}^{*}$ ) fulfills the optimality condition (12), (13), and (14), then it holds that $\mathcal{Y}^{*}=\textbf{1}_{K}\otimes y^{*}$ with ( $\mathcal{W}^{*},y^{*}$ ) satisfying optimality condition (10) and (11).

II-A2 Decentralized Strategy

Let $\mathcal{Y}_{-1}$ and $\mathcal{W}_{-1}$ take any arbitrary value and $\mathcal{X}_{-1}=0$ .

$\displaystyle\mathcal{W}_{i}$	$\displaystyle=\mathcal{W}_{i-1}-\mu_{\omega}\nabla\mathcal{J}(\mathcal{W}_{i-1% })-\mu_{\omega}\mathcal{C}_{d}^{T}\mathcal{Y}_{i-1},$	(15)
$\displaystyle\mathcal{Z}_{i}$	$\displaystyle=\mathcal{Y}_{i-1}+\mu_{y}\mathcal{C}_{d}\mathcal{W}_{i}+\mathcal% {D}\mathcal{X}_{i-1},$	(16)
$\displaystyle\mathcal{X}_{i}$	$\displaystyle=\mathcal{X}_{i-1}-\mathcal{D}\mathcal{Z}_{i},$	(17)
$\displaystyle\mathcal{Y}_{i}$	$\displaystyle=\operatorname{prox}_{\mu_{y}\mathcal{G}^{*}}(\mathcal{\bar{A}}% \mathcal{Z}_{i}),$	(18)

where $\mathcal{\bar{A}}=\bar{A}\otimes I_{M}$ and $\bar{A}\in\mathbb{R}^{N\times N}$ is left-stochastic or right-stochastic combination matrix.

II-A3 Implementation of proximal ATC for dual-diffusion algorithm

By rewriting (16)-(18) and removing the variable $\mathcal{X}_{i}$ , we obtain:

$\displaystyle\mathcal{Z}_{i}-\mathcal{Z}_{i-1}$	$\displaystyle=\mu_{y}\mathcal{C}_{d}(\mathcal{W}_{i}-\mathcal{W}_{i-1})+% \mathcal{Y}_{i-1}-\mathcal{Y}_{i-2}$
	$\displaystyle+\mathcal{D}(\mathcal{X}_{i-1}-\mathcal{X}_{i-2})$
	$\displaystyle=\mu_{y}\mathcal{C}_{d}(\mathcal{W}_{i}-\mathcal{W}_{i-1})+% \mathcal{Y}_{i-1}-\mathcal{Y}_{i-2}$
	$\displaystyle+\mathcal{D}(-\mathcal{D}\mathcal{Z}_{i-1}).$	(19)

Rearranging (19) yields

	$\displaystyle\mathcal{Z}_{i}$	$\displaystyle=(I-\mathcal{D}^{2})\mathcal{Z}_{i-1}+\mathcal{Y}_{i-1}-\mathcal{% Y}_{i-2}$
		$\displaystyle+\mu_{y}\mathcal{C}_{d}(\mathcal{W}_{i}-\mathcal{W}_{i-1}),$		(20)

$\displaystyle\mathcal{W}_{i}$	$\displaystyle=\mathcal{W}_{i-1}-\mu_{{\mathbf{w}}}\nabla\mathcal{J}(\mathcal{W% }_{i-1})-\mu_{{\mathbf{w}}}\mathcal{C}_{d}^{T}\mathcal{Y}_{i-1}$
$\displaystyle\mathcal{Z}_{i}$	$\displaystyle=(I-\mathcal{D}^{2})\mathcal{Z}_{i-1}+\mathcal{Y}_{i-1}-\mathcal{% Y}_{i-2}+\mu_{y}\mathcal{C}_{d}(\mathcal{W}_{i}-\mathcal{W}_{i-1}),$	(21)
$\displaystyle\phi_{i}$	$\displaystyle=\mathcal{\bar{A}}\mathcal{Z}_{i},$	(22)
$\displaystyle\mathcal{Y}_{i}$	$\displaystyle=\textbf{Prox}_{\mu_{y}\mathcal{G}^{*}}(\phi_{i}).$

II-A4 Diffusion Tracking Method

Let $\mathcal{D}=I-\mathcal{\bar{A}}$ and $\mathcal{\bar{A}}=\mathcal{A}$ , then, substituting these new variables into (21) results in:

	$\displaystyle\mathcal{Z}_{i}$	$\displaystyle=\mathcal{A}(2I-\mathcal{A})\mathcal{Z}_{i-1}+\mathcal{Y}_{i-1}-% \mathcal{Y}_{i-2}$
		$\displaystyle-\mu_{y}\mathcal{C}_{d}(\mathcal{W}_{i}-\mathcal{W}_{i-1}).$		(23)

Multiplying (23) by $\mathcal{\bar{A}}$ and using (22), we have

\displaystyle\phi_{i}=\mathcal{\bar{A}}(\phi_{i-1}+\mathcal{Y}_{i-1}-\mathcal{% Y}_{i-2}-\mu_{y}\mathcal{C}_{d}(\mathcal{W}_{i}-\mathcal{W}_{i-1})),

Hence, the final expressions for the diffusion tracking method are:

	$\displaystyle\mathcal{W}_{i}$	$\displaystyle=\mathcal{W}_{i-1}-\mu_{{\mathbf{w}}}\nabla\mathcal{J}(\mathcal{W% }_{i-1})-\mu_{{\mathbf{w}}}\mathcal{C}_{d}^{T}\mathcal{Y}_{i-1}$		(24)
	$\displaystyle\phi_{i}$	$\displaystyle=\mathcal{\bar{A}}(\phi_{i-1}+\mathcal{Y}_{i-1}-\mathcal{Y}_{i-2}% -\mu_{y}\mathcal{C}_{d}(\mathcal{W}_{i}-\mathcal{W}_{i-1}))$		(25)

II-A5 Adaptive Combination Matrix

We consider the following definitions and assumptions of the combination matrix:

	$\displaystyle A$	$\displaystyle=[a_{l,k}]\in\mathbb{R}^{N\times N},$
	$\displaystyle\mathcal{A}$	$\displaystyle=A\otimes I_{M},$		(26)

where $a_{l,k}$ is zero if there is no edge connection between agents $k$ and $\ell$ . Matrix $A$ is assumed to be a primitive symmetric and left/right-stochastic matrix. $\bar{A}=A$ , which is also a primitive symmetric left/right-stochastic matrix. $\mathcal{D}=I-\mathcal{A}$ . if the eigenvalues of $\mathcal{\bar{A}}$ belong to (-1,1], many choices for $\mathcal{D}$ exist.

Assumption 2

(Combination Matrix) Matrix $\mathcal{D}$ satisfies condition (8) and then below condition yield:

\displaystyle I-\mathcal{D}>0.

II-A6 Theorem (Convergence-rate)

Theorem 1

Subject to assumption .1 and .2, if $\mathcal{C}_{d}$ is full row rank and the step-sizes $\mu_{\omega}$ and $\mu_{y}$ are strictly positive and satisfy

	$\displaystyle\mu_{\omega}$	$\displaystyle\leq\frac{1}{2\delta}$		(27)
	$\displaystyle\mu_{y}$	$\displaystyle<\frac{\nu}{2\sigma_{max}^{2}(\mathcal{C}_{d})}$		(28)

it holds for all $i\geq 0$ and some $C_{o}\geq 0$ that:

\displaystyle\|\tilde{{\mathbf{w}}}_{i}\|^{2}\leq\gamma^{i}C_{o}

(29)

where

	$\displaystyle\gamma=Max$	$\displaystyle\{\frac{1-\mu_{\omega}\nu(1-\mu_{\omega}\delta)}{1-\mu_{y}\mu_{% \omega}\sigma_{max}^{2}(\mathcal{C}_{d})},1-\mu_{\omega}\mu_{y}\lambda_{min}(% \mathcal{C}_{d}\mathcal{C}_{d}^{T}),$
		$\displaystyle\underline{\sigma}^{2}(\mathcal{A})\\|\tilde{x}_{i-1}\\|^{2}\}<1$

See Appendix A.

Remark 2

$\lambda_{min}(\mathcal{C}_{d}\mathcal{C}_{d}^{T})$ is the smallest eigenvalue of $\mathcal{C}_{d}\mathcal{C}_{d}^{T}$ , $\underline{\sigma}(\mathcal{A})$ denotes the smallest non-zero singular value of $\mathcal{A}$ and $\sigma_{max}(\mathcal{A})$ is the largest singular value of $\mathcal{A}$ .

Remark 3

According to Theorem .1, PD algorithm has fast convergence for non-smooth g if $C_{k}$ has full row rank.

II-A7 Data Model

If we assume that cost function $J({\mathbf{w}})$ is $\delta$ -smooth ,then, following inequality is established,

	$\displaystyle\\|\nabla J({\mathbf{w}}(i-1))-\nabla J(w^{*})\\|^{2}$
	$\displaystyle\leq\delta{\mathbf{w}}^{T}(i-1)\Big{(}\nabla J({\mathbf{w}}(i-1))% -\nabla J(w^{*})\Big{)}$		(30)

Strong-convexity is often proposed in adaptation and streaming data to help algorithms against ill-condition by introducing a guard, as well as facilitating conditions to obtain a unique global minimum [45] - [46]. If $J({\mathbf{w}})$ is differentiable, the strong-convexity bound is obtained as

\displaystyle\tilde{{\mathbf{w}}}^{T}(i-1)(\nabla J({\mathbf{w}}(i-1))-\nabla J% (w^{*}))\geq\nu\|\tilde{{\mathbf{w}}}(i-1)\|^{2}

(31)

Table I: Convergence Properties of Distributed Algorithms

Algorithm	Support prox.operators	Rate(convex)	Step-size(upper bound)
EXTRA([22])	Yes	$O(\frac{1}{K})$	$O\Big{(}\frac{\upsilon(1-\rho)}{L^{2}}\Big{)}$
Aug-DGM([27])	Yes	converges	$\min\Big{\{}\frac{(1-\rho)^{2}}{10L\rho\sqrt{n}\sqrt{\kappa}},\frac{1}{2L}\Big% {\}}$
DIGing([38])	Yes	$-$	$O\Big{(}\frac{(1-\rho)^{2}}{\upsilon\kappa^{1.5}\sqrt{n}}\Big{)}$
NIDS([28])	Yes	$O(\frac{1}{K})$	$\frac{2}{L}$
Exact Diffusion([29])	Yes	$\frac{\kappa^{2}}{1-\rho}$	$O(\frac{\upsilon}{L^{2}})$
Primal-Dual	Yes	$O(\frac{1}{K})$	$O(\frac{1}{K})$
Primal-Dual Diffusion	Yes	$O(\frac{1}{K})$	$O(\frac{1}{K})$

III Proximal Decentralized Algorithm

To manage the non-smooth term $\mathrm{D}({\mathbf{w}})$ , it is defined as

\displaystyle\mathrm{D}({\mathbf{w}})\triangleq\frac{1}{N}\sum_{k=1}^{N}% \mathrm{D}({\mathbf{w}}_{k})

(32)

III-A Proximal Dual Diffusion Method

Following recursion suggested based on the (39), for $i=0,1,....$ , and each agent $k$ , if $\mathcal{\bar{A}}=\mathcal{A},\mathcal{D}=(I-\mathcal{A})$ , then

$\displaystyle{\mathbf{w}}_{k,i}$	$\displaystyle={\mathbf{w}}_{k,i-1}-\mu_{{\mathbf{w}}}\nabla J_{k}({\mathbf{w}}% _{k,i-1})-\mu_{{\mathbf{w}}}C_{k}^{T}y_{k,i-1}$	(33)
$\displaystyle\psi_{k,i}$	$\displaystyle=y_{k,i-1}+\mu_{y}C_{k}{\mathbf{w}}_{k,i}$	(34)
$\displaystyle z_{k,i}$	$\displaystyle=\phi_{k,i-1}+\psi_{k,i}-\psi_{k,i-1}$	(35)
$\displaystyle\phi_{k,i}$	$\displaystyle=\sum_{l\in\mathcal{N}_{k}}\bar{a}_{l,k}z_{l,i}$	(36)
$\displaystyle y_{k,i}$	$\displaystyle=\textbf{prox}_{\mu_{k}/Kg^{*}}(\phi_{k,i})$	(37)

C_{k}

\kappa

Initialize:

\text{Choose}\hskip 2.84526pt\phi_{k}(-1)=\psi_{k}(-1)=\theta_{k}(-1)=0

, for all

k=1,2,....,N

1: for Repeat

i\geq 0

2: set

{\mathbf{w}}_{k}(i)={\mathbf{w}}_{k}(i-1)-\mu_{w}\nabla J_{k}({\mathbf{w}}_{k}% (i-1))-\mu_{w}C_{k}^{T}y_{k}(i-1),

3: set

\psi_{k}(i)=y_{k}(i-1)+\mu_{y}C_{k}{\mathbf{w}}_{k}(i),

4: set

z_{k}(i)=\phi_{k}(i-1)+\psi_{k}(i)-\psi_{k}(i-1),

5: set

\phi_{k}(i)=\sum_{\ell\in\mathcal{N}_{k}}a_{\ell,k}z_{\ell}(i),

6: set

y_{k}(i)=Prox_{\mu_{y}/kg^{*}}(\phi_{k}(i))

Proximal Computation :

7: set

y_{k}(i)=Prox_{\mu_{y}/kg^{*}}(\phi_{k}(i))

8: set

\varpi=|\phi_{k}(i)|-\mu_{y}/k

9: set

\varpi^{{}^{\prime}}=max(\varpi,\kappa)exp(-\varpi)

10: if

\|\phi_{k}(i)\|_{2}<\varpi^{{}^{\prime}}

then

11: set

\theta_{1}=F_{k}(i)\sqrt{\|\epsilon_{k}(i)\|_{2}},

12: set

\theta_{2}=\|\phi_{k}(i)\|_{2},

13: set

\theta_{k}(i)=\frac{\theta_{1}}{1+\theta_{2}}sign(\phi_{k}(i))

14: else

15: set

\theta_{k}(i)=\varpi^{{}^{\prime}}sign(\phi_{k}(i))

16: end if

17: end for

Algorithm 1 Proximal-Dual diffusion with adaptive coefficient weights-Main Algorithm

Table II: Performance comparison of algorithms for directed and undirected graphs

Algorithm	Total time (second)	Number of iterations
EXTRA([22])	$19.80$	$500$
Aug-DGM([27])	$18.58$	$500$
DIGing([38])	$18.45$	$500$
NIDS([28])	$18.31$	$500$
Exact Diffusion([29])	$25.62$	$500$
Primal-Dual	$18.01$	$500$
Primal-Dual Diffusion	$17.15$	$500$

The desired variance can be estimated iteratively, and the factor parameter estimated by agent $\ell$ by running the following smoothing filter:

\displaystyle\gamma_{\ell,k}(i)=(1-\zeta)\gamma_{\ell,k}(i-1)+\zeta\|{\mathbf{% w}}_{\ell}(i-1)\|^{2},

(38)

where { ${\mathbf{w}}_{\ell}(i-1)$ }, which is needed to run the recursion at agent k, is computed by agent $k$ at iteration i. In order to remove the necessity of transmitting $\gamma_{\ell,k}(i)$ from agent $\ell$ to its neighbors, the estimation of $\gamma_{\ell,k}(i)$ designed into neighbors of agent $\ell$ . It gives the advantage of not sharing the value and overcomes the difficulty of accessing agent k to ${\mathbf{w}}_{\ell}(i-1)$ in the ATC diffusion implementation. This works by the mechanism of receiving from its neighbor $\ell$ , we replaced ${\mathbf{w}}_{\ell}(i-1)$ by ${\mathbf{w}}_{k}(i-1)$ since for $i\gg 1$ . Note that the iterates at various agents approach to an optimal point. With this substitution, agent $k$ can now estimate the variance $\gamma_{\ell}^{2}(i)$ of its neighbor locally by running a smoothing filter of the following form:

\displaystyle\gamma_{\ell,k}(i)=(1-\zeta)\gamma_{\ell,k}(i-1)+\zeta\|{\mathbf{% w}}_{k}(i-1)\|^{2}.

(39)

The above expression provides part of an adaptive construction for the combination weights $a_{l,k}$ . Besides, adaptive weights are factors to evaluate ${\mathbf{w}}_{k}(i)$ .

The proposed Proximal-Dual diffusion algorithm along with its adaptive coefficient weights are described in Algorithm 1 and Algorithm 2, respectively.

\zeta

Initialize:

\text{Choose}\hskip 2.84526pt\gamma_{\ell,k}(-1)=0

, for all

k=1,2,....,N

and

\ell\in\mathcal{N}_{k}

1: for Repeat

i\geq 0

2: set

\chi_{\ell,k}^{2}(i)=(1-\zeta)\chi_{\ell,k}^{2}(i-1)+\zeta Max\{\frac{1-\mu_{% \omega}\nu(1-\mu_{\omega}\delta)}{1-\mu_{y}\mu_{\omega}\sigma_{max}^{2}(% \textbf{C}_{d})},1-\mu_{\omega}\mu_{y}\lambda_{min}(\textbf{C}_{d}\textbf{C}_{% d}^{T}),\underline{\sigma}^{2}(\mathcal{A})\|\tilde{x}_{i-1}\|^{2}\}\times% \frac{\|\tilde{{\mathbf{w}}}_{k}(-1)\|_{I-\mu_{y}\mu_{\omega}C_{d}^{T}C_{d}}^{% 2}+a_{m}\|\tilde{{\mathbf{y}}}_{k}(-1)\|_{I-\mu_{y}\mu_{w}\mathcal{C}_{d}% \mathcal{C}_{d}^{T}}^{2}+a_{m}\|\tilde{{\mathbf{x}}}_{k}(-1)\|^{2}}{1-\mu_{y}% \mu_{w}\sigma^{2}_{(}Max)(\mathcal{C}_{d})},\ell\in\mathcal{N}_{k}

3: set

a_{\ell,k}(i)=\frac{exp(\chi_{\ell,k}^{-2}(i))}{\sum_{j\in\mathcal{N}_{k}}exp(% \chi_{j,k}^{-2}(i))},\ell\in\mathcal{N}_{k}

4: end for

Algorithm 2 Algorithm for Adaptive Coefficient Weights

IV Fast Convergence

If $\mathcal{X}_{-1}=0$ , then $\mathcal{X}_{1}=-\mathcal{D}\mathcal{Z}_{1}$ belongs to range space of $\mathcal{D}$ . As a result, $\{\mathcal{X}_{i}\}_{i\geq 0}$ will always remain in the range space of $\mathcal{D}$ . The iterates $(\mathcal{W}_{i},\mathcal{Y}_{i},\mathcal{Z}_{i})$ converge to this point $(\mathcal{W}^{*},\mathcal{Y}^{*},\mathcal{Z}^{*})$ . The error quantities are as follows:

$\displaystyle\tilde{\mathcal{W}}_{i}$	$\displaystyle\triangleq\mathcal{W}_{i}-\mathcal{W}^{*}$	(40)
$\displaystyle\tilde{\mathcal{Y}}_{i}$	$\displaystyle\triangleq\mathcal{Y}_{i}-\mathcal{Y}^{*}$	(41)
$\displaystyle\tilde{\mathcal{Z}}_{i}$	$\displaystyle\triangleq\mathcal{Z}_{i}-\mathcal{Z}^{*}$	(42)
$\displaystyle\tilde{\mathcal{X}}_{i}$	$\displaystyle\triangleq\mathcal{X}_{i}-\mathcal{X}^{*}$	(43)

The expressions in (33)-(37) produce following error recursions:

$\displaystyle\mathcal{\tilde{W}}_{i}$	$\displaystyle=\mathcal{\tilde{W}}_{i-1}-\mu_{\omega}F_{k}(i)(\nabla\mathcal{J}% (\mathcal{W}_{i-1})-\nabla\mathcal{J}(\mathcal{W}^{*}))$
	$\displaystyle-\mu_{\omega}\mathcal{C}_{d}^{T}\mathcal{\tilde{Y}}_{i-1}$	(44)
$\displaystyle\mathcal{\tilde{Z}}_{i}$	$\displaystyle=\mathcal{\tilde{Y}}_{i-1}+\mu_{y}\mathcal{C}_{d}\mathcal{\tilde{% W}}_{i}+\mathcal{D}\mathcal{\tilde{X}}_{i-1}$	(45)
$\displaystyle\mathcal{\tilde{X}}_{i}$	$\displaystyle=\mathcal{\tilde{X}}_{i-1}-\mathcal{D}\mathcal{\tilde{Z}}_{i}$	(46)
$\displaystyle\mathcal{\tilde{Y}}_{i}$	$\displaystyle=\textbf{Prox}_{\mu_{y}\mathcal{G}^{}}(\mathcal{\bar{A}}\mathcal% {Z}_{i})-\textbf{Prox}_{\mu_{y}\mathcal{G}^{}}(\mathcal{\bar{A}}\mathcal{Z}^{% *})$	(47)

The network mean-square deviation is:

{\mathrm{MSD}}=\lim_{i\rightarrow\infty}\frac{1}{N}Tr\Big{(}E\left\{\mathcal{% \tilde{Y}}_{i}\mathcal{\tilde{Y}}_{i}^{T}\right\}\Big{)}

(48)

V Convergence properties and performance comparison of algorithms

Table I compares the convergence properties of distributed algorithms in term of supporting prox operators, the rate, and upper bound of the proposed primary-dual diffusion (PDD) and primal-dual(PD) algorithms with those of NIDS, EXTRA and DIGing-ATC, Aug-DGM, Exact diffusion algorithms. As it is seen, the proposed primal-dual diffusion algorithm has the same upper bound of step size as that of the primal-dual algorithm. However, the convergence rate for both PD diffusion and PD is higher than those of the NIDS, EXTRA, DIGing-ATC, Aug-DGM, and Exact diffusion algorithms. Due to the fact shown in Table I, PDD and PD require fewer parameters than DIGing, EXTRA and Aug-DGM algorithms, therefore, their upper bounds are simpler. Table II compares the performance of the algorithms for directed and undirected graphs in terms of total time and number of iterations. All algorithms were assumed to have the same number of iterations. As shown, the total time per second is different for all algorithms with the same number of iterations. It can be seen that PDD algorithm outperforms the other algorithms.

VI Simulation Results

On distributed parameter estimation, simulation for a network with N = 20 nodes are presented. Figure 1 (a) depicts the directed network topology used in all simulations. For computing the combination coefficients, the Metropolis rule is used.

VI-A Decentralized compressed sensing

Consider a decentralized compressed sensing problem involving some network agents. Each agent $i\in{1,....,n}$ has its own measurement via $y_{i}=M_{i}x+e_{i}$ , where $y_{i}\in\mathbb{R}^{p}$ is an unknown sparse signal and $e_{i}\in\mathbb{R}^{m_{i}}$ is an i.i.d Gaussian noise vector. We consider $\lambda\|x\|$ as a non-smooth function and to simplify the problem, $\lambda$ value is 1. To estimate $x$ as the sparse vector, we use the decentralized algorithm.

VI-A1 The case without non-smooth function

Consider the decentralized problem of solving for an unknown signal $x\in\mathbb{R}$ in the absence of a non-smooth function.

VI-A2 The case with nonsmooth function

A non-smooth function was considered to solve a decentralized compressed sensing problem.
The comparison of the simulation results for the proposed primal-dual diffusion (PDD) and primal-dual (PD) algorithms with those of NIDS-adaptive, NIDS, EXTRA, and DIGing-ATC algorithms are shown Fig. 1 (a). Note that in this scenario, we ignored the non-smooth function. It is seen that the proposed PDD outperforms the other algorithms in terms of square error.

In Fig 1 (b), we have compared the performance of the proposed PDD algorithm with those of EXTRA and PD algorithms, where we have a non-smooth function. As it is seen the proposed PDD outperforms EXTRA and PD algorithms.

VII Conclusion

We have studied the distributed multi-agent sharing optimization problem in a directed graph, with a composite objective function consisting of a smooth function plus a non-smooth function shared through all agents in a network. To solve the problem, some reformulations were applied. A new upper bound on step sizes are obtained, from which a linear convergence can be achieved under the strong convexity assumption. Further, we proposed the adaptive coefficient weights that are time-varying. We show that the proposed algorithm develops the linear convergence to the global minimizer with adaptive coefficient weights and new bound of step sizes.

Appendix A Proof of Theorem .1

To solve (44, 45, 46, 47) problems, First, we need to define some assumptions, then, square both sides of them to reach optimal step-sizes of PD algorithm and convergence rate equation:

$\displaystyle\tilde{{\mathbf{w}}}_{k}(i)$	$\displaystyle\triangleq{\mathbf{w}}_{k}(i)-w^{*}$	(49)
$\displaystyle\tilde{{\mathbf{y}}}_{k}(i)$	$\displaystyle\triangleq{\mathbf{y}}_{k}(i)-y^{*}$	(50)
$\displaystyle\tilde{{\mathbf{z}}}_{k}(i)$	$\displaystyle\triangleq{\mathbf{z}}_{k}(i)-z^{*}$	(51)
$\displaystyle\tilde{{\mathbf{x}}}_{k}(i)$	$\displaystyle\triangleq{\mathbf{x}}_{k}(i)-x^{*}$	(52)

Based on agent formulation, equations are reformulated as follows:

$\displaystyle{\mathbf{w}}_{k}(i)$	$\displaystyle={\mathbf{w}}_{k}(i-1)-\mu_{\omega}\nabla J({\mathbf{w}}(i-1))-% \mu_{\omega}\mathcal{C}_{d}^{T}{\mathbf{y}}_{k}(i-1)$	(53)
$\displaystyle\tilde{{\mathbf{w}}}_{k}(i)$	$\displaystyle=\tilde{{\mathbf{w}}}_{k}(i-1)-\mu_{\omega}\{\nabla J({\mathbf{w}% }(i-1))-\nabla J(w^{*})\}$
	$\displaystyle\hskip 10.00002pt-\mu_{\omega}\mathcal{C}_{d}^{T}\tilde{{\mathbf{% y}}}_{k}(i-1)$	(54)
$\displaystyle{\mathbf{z}}_{k}(i)$	$\displaystyle={\mathbf{y}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}{\mathbf{w}}_{k}(i)+% \mathcal{D}{\mathbf{x}}_{i-1}$	(55)
$\displaystyle\tilde{{\mathbf{z}}}_{k}(i)$	$\displaystyle=\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)+\mathcal{D}\tilde{{\mathbf{x}}}_{k}(i-1)\Longrightarrow^{% \mathclap{\mathcal{D}=I-A}}$	(56)
$\displaystyle\tilde{{\mathbf{z}}}_{k}(i)$	$\displaystyle=\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)+(I-A)\tilde{{\mathbf{x}}}_{k}(i-1)$	(57)
$\displaystyle{\mathbf{x}}_{k}(i)$	$\displaystyle={\mathbf{x}}_{k}(i-1)-\mathcal{D}{\mathbf{z}}_{k}(i)% \Longrightarrow^{\mathclap{\mathcal{D}=I-A}}$	(58)
$\displaystyle\tilde{{\mathbf{x}}}_{k}(i)$	$\displaystyle=\tilde{{\mathbf{x}}}_{k}(i-1)-(I-A)\tilde{{\mathbf{z}}}_{k}(i)$	(59)
$\displaystyle{\mathbf{y}}_{k}(i)$	$\displaystyle=Prox(\overline{A}{\mathbf{z}}_{k}(i))$	(60)
$\displaystyle\tilde{{\mathbf{y}}}_{k}(i)$	$\displaystyle=Prox(\overline{A}{\mathbf{z}}_{k}(i))-Prox(\overline{A}{\mathbf{% z}}^{*})\Longrightarrow^{\mathclap{\overline{A}=A}}$	(61)
$\displaystyle\tilde{{\mathbf{y}}}_{k}(i)$	$\displaystyle=Prox(A{\mathbf{z}}_{k}(i))-Prox(A{\mathbf{z}}^{*})$	(62)

Now, Squaring of (54),(57),(59), and (62),

$\displaystyle\\|\tilde{{\mathbf{w}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{w}}}_{k}(i-1)-\mu_{\omega}\{\nabla J({\mathbf{% w}}_{k}(i-1))-\nabla J(w^{*})\}\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\mu_{\omega}\tilde{{\mathbf{y}}}_{k}^{T}(i-1)% \mathcal{C}_{d}\Big{(}\tilde{{\mathbf{w}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt-\mu_{\omega}\{\nabla J({\mathbf{w}}_{k}(i-1))-% \nabla J(w^{*})\}\Big{)}$	(63)
$\displaystyle\\|\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}+\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)\tilde{{% \mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i)$	(64)
$\displaystyle\\|\tilde{{\mathbf{x}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|(I-A)\tilde{{\mathbf{z}}% }_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)\tilde{{% \mathbf{z}}}_{k}(i)$
	$\displaystyle=^{\mathclap{(\ref{eq:63})}}\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}% +\\|(I-A)\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{% {\mathbf{y}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(% i)+(I-A)\tilde{{\mathbf{x}}}_{k}(i-1))$	(65)
	$\displaystyle=\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|(I-A)\tilde{{\mathbf{z}}% }_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{% {\mathbf{y}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(% i))-2\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1))\\|^{2}$	(66)

Adding $2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}% \mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i))$ and $-\|{\mathbf{x}}_{i}\|^{2}$ to both sides of (66) :

	$\displaystyle 2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{{\mathbf{y}}}_{k}% (i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i))-\\|{\mathbf{x}}_{i}\\|^% {2}+\\|{\mathbf{x}}_{i}\\|^{2}$
	$\displaystyle=\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|(I-A)\tilde{{\mathbf{z}}% }_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{% {\mathbf{y}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(% i))-2\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1))\\|^{2}$
	$\displaystyle\hskip 10.00002pt+2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{% {\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i))-\\|{% \mathbf{x}}_{i}\\|^{2}$		(67)

	$\displaystyle 2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{{\mathbf{y}}}_{k}% (i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i))$
	$\displaystyle=\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|(I-A)\tilde{{\mathbf{z}}% }_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1))\\|^{2}-\\|% {\mathbf{x}}_{i}\\|^{2}$		(68)

then, we substitute (68) in (64), yields,

$\displaystyle\\|\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|% \tilde{{\mathbf{x}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}-2\\|(I-% A)(\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i))\\|^{2}$
	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|% \tilde{{\mathbf{x}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}-\\|(I-A% )(\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i))\\|^{2}$	(69)

Both sides of (69) should be deducting of $\|(I-A)\tilde{{\mathbf{z}}}_{k}(i)\|^{2}$ , then, yields,

	$\displaystyle\\|\tilde{{\mathbf{z}}}_{k}(i)\\|_{A^{2}}^{2}$	$\displaystyle=\underbrace{\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{% d}\tilde{{\mathbf{w}}}_{k}(i)\\|^{2}}_{\mathclap{D^{{}^{\prime}}}}-\\|\tilde{{% \mathbf{x}}}_{k}(i)\\|^{2}$
		$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|_{A^{2}}^{2}$		(70)

Solving ( $D^{{}^{\prime}}$ ):

$\displaystyle\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}+\\|\mu_{y}\mathcal{C}_{d}% \tilde{{\mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+2\tilde{{\mathbf{y}}}_{k}(i-1)^{T}\mu_{y}% \mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i)$
	$\displaystyle=^{\mathclap{(\ref{eq:58})}}\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k% }(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+2\tilde{{\mathbf{y}}}_{k}(i-1)^{T}\mu_{y}% \mathcal{C}_{d}$
	$\displaystyle\hskip 10.00002pt\times\Big{(}\tilde{{\mathbf{w}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt-\mu_{\omega}\{\nabla J({\mathbf{w}}(i-1))$
	$\displaystyle\hskip 10.00002pt-\nabla J(w^{*})\}-\mu_{\omega}\mathcal{C}_{d}^{% T}\tilde{{\mathbf{y}}}_{k}(i-1)\Big{)}$	(71)

similar to (68), we can have,

	$\displaystyle 2\tilde{{\mathbf{y}}}_{k}^{T}(i-1)\mathcal{C}_{d}(\tilde{{% \mathbf{w}}}_{k}(i-1)-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^% {*})))$
	$\displaystyle=\frac{1}{\mu_{\omega}}(\\|\tilde{{\mathbf{w}}}_{k}(i-1)-\mu_{% \omega}(\nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))\\|^{2}$
	$\displaystyle\hskip 10.00002pt-\\|\tilde{{\mathbf{w}}}_{k}(i)\\|^{2}+\\|\mu_{% \omega}\mathcal{C}_{d}\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2})$		(72)

if we put (72) in ( $D^{{}^{\prime}}$ ), yields,

$\displaystyle\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}+\\|\mu_{y}\mathcal{C}_{d}% \tilde{{\mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\frac{\mu_{y}}{\mu_{w}}\Big{(}\tilde{{\mathbf{w% }}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-% \nabla J(w^{*}))$
	$\displaystyle\hskip 10.00002pt-\\|{\mathbf{w}}_{k}(i)\\|^{2}+\\|\mu_{w}\mathcal{C% }_{d}\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}\Big{)}$
	$\displaystyle\hskip 10.00002pt-2\mu_{y}\mu_{w}\\|\mathcal{C}_{d}\tilde{{\mathbf% {y}}}_{k}(i-1)\\|^{2}$	(73)

putting (73) in (70),

$\displaystyle\\|\tilde{{\mathbf{z}}}_{k}(i)\\|_{A^{2}}^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}+\\|\mu_{y}\mathcal{C}_{d}% \tilde{{\mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\frac{\mu_{y}}{\mu_{w}}\Big{(}\\|\tilde{{\mathbf% {w}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-% \nabla J(w^{*}))\\|^{2}$
	$\displaystyle\hskip 10.00002pt-\\|\tilde{{\mathbf{w}}}_{k}(i)\\|^{2}+\\|\mu_{w}% \mathcal{C}_{d}\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}\Big{)}$
	$\displaystyle\hskip 10.00002pt-2\mu_{y}\mu_{w}\\|\mathcal{C}_{d}\tilde{{\mathbf% {y}}}_{k}(i-1)\\|^{2}-\\|\tilde{{\mathbf{x}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|_{A^{2}}^{2}$	(74)

Multiplying (74) by $a_{m}=\frac{\mu_{w}}{\mu_{y}}$ ,

		$\displaystyle a_{m}\\|\tilde{{\mathbf{z}}}_{k}(i)\\|_{A^{2}}^{2}+a_{m}\\|\tilde{{% \mathbf{x}}}_{k}(i)\\|^{2}=a_{m}\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|_{I-\mu_{y}\mu% _{w}\mathcal{C}_{d}\mathcal{C}_{d}^{T}}^{2}$
		$\displaystyle\hskip 10.00002pt-\\|\tilde{{\mathbf{w}}}_{k}(i)\\|_{I-\mu_{w}\mu_{% y}\mathcal{C}_{d}^{T}\mathcal{C}_{d}}^{2}$
		$\displaystyle\hskip 10.00002pt+a_{m}\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$
		$\displaystyle+\underbrace{\\|\tilde{{\mathbf{w}}}_{k}(i-1)-\mu_{\omega}(\nabla J% ({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))\\|^{2}}_{\mathclap{D^{"}}}$		(75)

For solving ( $D^{"}$ ), data model No.6 and data model No.7 needed to be utilize.

$\displaystyle\\|\tilde{{\mathbf{w}}}_{k}(i-1)$	$\displaystyle-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))\\|^% {2}$
	$\displaystyle=\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|^{2}+\mu_{\omega}^{2}\\|\nabla J% ({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*})\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\mu_{\omega}\tilde{{\mathbf{w}}}_{i-1}^{T}(% \nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))$	(76)

Rely on data model No.6

$\displaystyle\\|\tilde{{\mathbf{w}}}_{k}(i-1)$	$\displaystyle-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))\\|^% {2}$
	$\displaystyle\leq\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\mu_{\omega}^{2}\delta\tilde{{\mathbf{w}}}_{i-1% }^{T}\Big{(}\nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*})\Big{)}$
	$\displaystyle\hskip 10.00002pt-2\mu_{\omega}\tilde{{\mathbf{w}}}_{i-1}^{T}(% \nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))$	(77)

and data model No.7,

$\displaystyle\\|\tilde{{\mathbf{w}}}_{k}(i-1)$	$\displaystyle-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))\\|^% {2}$
	$\displaystyle\leq\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|^{2}+\mu_{\omega}^{2}\delta% \nu\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\mu_{\omega}\nu\\|\tilde{{\mathbf{w}}}_{k}(i-1)% \\|^{2}$	(78)

	$\displaystyle\\|\tilde{{\mathbf{w}}}_{k}(i-1)$	$\displaystyle-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-\nabla J(w^{*}))\\|^% {2}$
		$\displaystyle\leq(1-\mu_{\omega}\nu(2-\mu_{\omega}\delta))\\|\tilde{{\mathbf{w}% }}_{k}(i-1)\\|^{2}$		(79)

then,

$\displaystyle\\|$	$\displaystyle\tilde{{\mathbf{w}}}_{k}(i-1)-\mu_{\omega}(\nabla J({\mathbf{w}}_% {k}(i-1))-\nabla J(w^{*}))\\|^{2}$
	$\displaystyle\leq(1-\mu_{\omega}\nu(2-\mu_{\omega}\delta))\\|\tilde{{\mathbf{w}% }}_{k}(i-1)\\|^{2}$
	$\displaystyle\leq(1-\mu_{\omega}\nu(1-\mu_{\omega}\delta))\\|\tilde{{\mathbf{w}% }}_{k}(i-1)\\|^{2}-\mu_{\omega}\nu\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\leq(1-\mu_{\omega}\nu(1-\mu_{\omega}\delta))\\|\tilde{{\mathbf{w}% }}_{k}(i-1)\\|_{I-\mu_{y}\mu_{\omega}C_{d}^{T}C_{d}}^{2}$
	$\displaystyle\leq\frac{1-\mu_{\omega}\nu(1-\mu_{\omega}\delta)}{1-\mu_{y}\mu_{% w}\sigma_{Max}^{2}(C_{d})}\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|_{I-\mu_{y}\mu_{% \omega}C_{d}^{T}C_{d}}^{2}$	(80)

For positive step-sizes, $a_{m}>0$ , $1-\mu_{\omega}\nu(1-\mu_{\omega}\delta)>0$ , then, we have $\mu_{\omega}<\frac{1}{2\delta}$ . In addition, $\frac{1-\mu_{\omega}\nu(1-\mu_{\omega}\delta)}{1-\mu_{y}\mu_{w}\sigma_{Max}^{2% }(C_{d})}>1$ , consequently, $\mu_{y}\leq\frac{\nu}{2\sigma_{Max}^{2}(C_{d})}.$ Finally, (75) yields:

		$\displaystyle\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|_{I-\mu_{y}\mu_{\omega}C_{d}^{T}% C_{d}}^{2}+a_{m}\\|\tilde{{\mathbf{z}}}_{k}(i)\\|_{A^{2}}^{2}+a_{m}\\|\tilde{{% \mathbf{x}}}_{k}(i)\\|^{2}$
		$\displaystyle=a_{m}\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|_{I-\mu_{y}\mu_{w}\mathcal% {C}_{d}\mathcal{C}_{d}^{T}}^{2}+a_{m}\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$
		$\displaystyle+\frac{1-\mu_{\omega}\nu(1-\mu_{\omega}\delta)}{1-\mu_{y}\mu_{w}% \sigma_{Max}^{2}(C_{d})}\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|_{I-\mu_{y}\mu_{% \omega}C_{d}^{T}C_{d}}^{2}$		(81)

Assumption 3

if $\tilde{{\mathbf{z}}}_{k}(i)$ and $\tilde{{\mathbf{x}}}_{k}(i-1)$ lied in the range space of $\mathcal{A}$ , then,

\displaystyle\|\tilde{{\mathbf{x}}}_{k}(i-1)\|_{A^{2}}^{2}

\displaystyle\geq\underline{\sigma}^{2}(A)\|\tilde{{\mathbf{x}}}_{k}(i-1)\|^{2}

(82)

where $\underline{\sigma}^{2}(A)$ is the minimum non-zero singular value of $\mathcal{A}$ . $\iota_{min}(C_{d}C_{d}^{T})$ is assumed as the smallest eigenvalue of $C_{d}C_{d}^{T}$ which is positive-definite and $\iota_{min}(C_{d}C_{d}^{T})I\leq C_{d}C_{d}^{T}$ .

	$\displaystyle\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|$	${}_{I-\mu_{y}\mu_{w}\mathcal{C}_{d}\mathcal{C}_{d}^{T}}^{2}$
		$\displaystyle\leq(1-\mu_{y}\mu_{\omega}\iota_{min}(C_{d}C_{d}^{T}))\\|\tilde{{% \mathbf{y}}}_{k}(i-1)\\|^{2}$		(83)

Remark 4

Squaring both sides of (62), introduced below equation,

$\displaystyle\\|\tilde{y}_{i}\\|^{2}$	$\displaystyle=\\|Prox(\mathcal{\bar{A}}\mathcal{Z}_{i})-Prox(\mathcal{\bar{A}}% \mathcal{Z}^{*})\\|^{2}$
	$\displaystyle\leq\\|\mathcal{\bar{A}}\tilde{\mathcal{Z}_{i}}\\|^{2}=\\|\tilde{% \mathcal{Z}_{i}}\\|_{\mathcal{\bar{A}}^{2}}^{2}$
	$\displaystyle\leq^{(\mathcal{\bar{A}}=\mathcal{A})}\\|\tilde{\mathcal{Z}_{i}}\\|% _{\mathcal{A}^{2}}^{2}$	(84)

Based on Assumption 3. and Remark 6., (81) changed and yields,

		$\displaystyle\\|\tilde{{\mathbf{w}}}_{k}(i-1)\\|_{I-\mu_{y}\mu_{\omega}C_{d}^{T}% C_{d}}^{2}+a_{m}\\|\tilde{{\mathbf{y}}}_{k}(i)\\|^{2}+a_{m}\\|\tilde{{\mathbf{x}}% }_{k}(i)\\|^{2}$
		$\displaystyle=\gamma_{1}\tilde{{\mathbf{w}}}_{k}(i-1)\\|_{I-\mu_{y}\mu_{\omega}% C_{d}^{T}C_{d}}^{2}$
		$\displaystyle+a_{m}\gamma_{2}\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|_{I-\mu_{y}\mu_{% w}\mathcal{C}_{d}\mathcal{C}_{d}^{T}}^{2}$
		$\displaystyle+a_{m}\gamma_{3}\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$		(85)

where

$\displaystyle\gamma_{1}$	$\displaystyle\triangleq\frac{1-\mu_{\omega}\nu(1-\mu_{\omega}\delta)}{1-\mu_{y% }\mu_{w}\sigma_{Max}^{2}(C_{d})}$	(86)
$\displaystyle\gamma_{2}$	$\displaystyle\triangleq(1-\mu_{y}\mu_{\omega}\iota_{min}(C_{d}C_{d}^{T}))$	(87)
$\displaystyle\gamma_{3}$	$\displaystyle\triangleq\underline{\sigma}^{2}(A)$	(88)

References

[1] S. Boyd, N. Parikh, and E. Chu, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” in Found. Trends Mach. Lear, vol. 3, no. 1, pp. 1–22, 2011.
[2] S. Sundhar Ram, A. Nedic, and V. V. Veeravalli, “A new class of distributed optimization algorithms: Application to regression of distributed data,” in Optimization Methods and Software, vol. 27, no. 1, pp. 71–88, 2012.
[3] J. Chen, Z. J. Towfic, and A. H. Sayed, “Dictionary learning over distributed models,” in IEEE Trans. Signal Process., vol. 63, no. 4, pp. 1001–1016, 2015.
[4] D. Hallac, J. Leskovec, and S. Boyd, “Network LASSO: Clustering and optimization in large graphs,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387–396, 2015.
[5] T.-H. Chang, A. Nedic, and A. Scaglione, “Distributed constrained optimization by consensus-based primal-dual perturbation method,” in IEEE Trans. Autom. Control, vol. 59, no. 6, pp. 1524–1538, 2014.
[6] D. P. Palomar and M. Chiang, “A tutorial on decomposition methods for network utility maximization,” in IEEE Journal on Selected Areas in Communications, vol. 24, no. 8, pp. 1439–1451, 2006.
[7] C. Xi, Q. Wu, and U. A. Khan, “On the distributed optimization over directed networks,” in Neurocomputing, vol. 267, pp. 508–515, 2017.
[8] K. I Tsianos, S. Lawlor, and M. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in 2012 ieee 51st ieee conference on decision and control (cdc), pp. 5453–5458, 2012.
[9] A. Nedić, A. Olshevsky, “Distributed optimization over time-varying directed graphs,” in IEEE Transactions on Automatic Control, vol. 60, no. 3, pp. 601–615, 2014.
[10] D. Kempe, A. Dobra, and J. Gehrke “Gossip-based computation of aggregate information,” in 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings. pp. 482–491, 2003.
[11] K. Cai, H. Ishii “Average consensus on general strongly connected digraphs,” in Automatica, vol. 48, no. 11, pp. 2750–2761, 2012.
[12] C. Xi, U. A. Khan “Distributed subgradient projection algorithm over directed graphs,” in IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3986–3992, 2016.
[13] A. Olshevsky “Efficient information aggregation strategies for distributed control and signal processing,” in arXiv preprint arXiv:1009.6036, 2010.
[14] S.S. Ram, V.V. Veeravalli, and A. Nedi ć. “Distributed non-autonomous power control through distributed convex optimization,” in IEEE INFOCOM, pp. 3001–3005 2009.
[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, and others “Distributed optimization and statistical learning via the alternating direction method of multipliers,” in Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
[16] V. Cevher, S. Becker, and M. Schmidt “Convex Optimization for Big Data: Scalable, randomized, and parallel algorithms for big data analytics,” in IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 32–43, 2014.
[17] D. P. Bertsekas “Incremental proximal methods for large scale convex optimization,” in Mathematical programming, vol. 129, no. 2, pp. 163–195, 2011.
[18] A. Nedic, D. P. Bertsekas “Incremental Subgradient Methods for Nondifferentiable Optimization,” in SIAM Journal on Optimization, vol. 12, no. 1, pp. 109–138, 2001.
[19] S. Ram, A. Nedić, V. Veeravalli “Incremental stochastic subgradient algorithms for convex optimization,” in SIAM Journal on Optimization, vol. 20, no. 2, pp. 691–717, 2009.
[20] A. Nedić, D. Bertsekas, “Convergence rate of incremental subgradient algorithms,” in SIAM Journal on Optimization, pp. 223–264, 2001.
[21] A. Nedić, A. Ozdaglar “Distributed subgradient methods for multi-agent optimization,” in IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
[22] W. Shi, Q. Ling, G. Wu, and W. Yin “Extra: An exact first-order algorithm for decentralized consensus optimization,” in SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[23] M. Zhu, S. Martínez “Discrete-time dynamic average consensus,” in Automatica, vol. 46, no. 2, pp. 322–329, 2010.
[24] C. Xi, U. S. Khan “DEXTRA: A Fast Algorithm for Optimization Over Directed Graphs,” in IEEE Transactions on Automatic Control, vol. 62, no. 10, pp. 4980–4993, 2017.
[25] C. Xi, R. Xin, U. S. Khan “ADD-OPT: Accelerated distributed directed optimization,” in IEEE Transactions on Automatic Control, vol. 63, no. 5, pp. 1329–1339, 2017.
[26] C. Xi, V. S. Mai, R. Xin, E. H. Abed, U. S. Khan “Linear convergence in optimization over directed graphs with row-stochastic matrices,” in IEEE Transactions on Automatic Control, vol. 63, no. 10, pp. 3558–3565, 2018.
[27] J. Xu, S. Zhu, Y. Soh, and L. Xie “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in Proceedings of the 54th IEEE Conference on Decision and Control (CDC), pp. 2055–2060, 2015.
[28] L. Zhi, S. Wei, and Y. Ming “A Decentralized Proximal-Gradient Method With Network Independent Step-Sizes and Separated Convergence Rates,” in IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019.
[29] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed “Exact diffusion for distributed optimization and learning—Part I: Algorithm development,” in IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018.
[30] A. H. Sayed, “Adaptation, Learning and Optimization over networks,” in Foundations and Trends in Machine Learning, vol. 7, no. 4-5, pp. 311–801, 2014, 10.1109.
[31] S. Haykin, “Cognitive Dynamic Systems: Perception-action Cycle, Radar and Radio,”Cambridge University Press, New York, NY, USA. , 2012.
[32] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control,, vol. 54, no. 1, pp. 48-61, 2009.
[33] J. Chen and A. H. Sayed, “Distributed Pareto optimization via diffusion strategies,” IEEE J. Sel. Topics Signal Process., vol. 7, no. 2, pp. 205-220, April. 2013.
[34] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on optimization., vol. 26, no. 3, pp. 1835-1854, 2016.
[35] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual Averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic Control,, vol. 57, no. 3, pp. 592-606, 2012.
[36] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Trans. Signal Process.,, vol. 62, no. 7, pp. 1750-1761, 2014.
[37] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems,, vol. 5, no. 3, pp. 1245-1260, 2018.
[38] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization,, vol. 27, no. 4, pp. 2597-2633, 2017.
[39] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulie “Optimal algorithms for smooth and strongly convex distributed optimization in the network,” in International Conference on Machine Learning (ICML). Stockholm, Sweden, 2017, pp. 3027-3036,.
[40] A. I. Chen, and A. Ozdaglar, “A fast distributed optimal-gradient method,” in Annual Allerton Conference on communication , Control, and Computing. Monticello, IL, USA, Oct. 2012, pp. 601-608.
[41] G. M. Baudet, “Asynchronous iterative methods for multiprocessors,” Journal of the ACM(JACM), vol. 25, no. 2, pp. 226-244, 1978.
[42] D. Bertsekas, “Distributed dynamic programming,” IEEE Transactions on Automatic Control, vol. 27, no. 3, pp. 610-616, 1982.
[43] S. A. Alghunaim, M. Yan, and A. H. Sayed, “A Multi-Agent Primal-Dual Strategy for Composite Optimization over Distributed Features,” IEEE 28th European Signal Processing Conference (EUSIPCO), pp. 2095–2099, 2021.
[44] S. A. Alghunaim, Q. Lyu, M. Yan, and A. H. Sayed, “A Multi-Agent Primal-Dual Strategy for Composite Optimization over Distributed Features,” IEEE Transactions on Signal Processing, vol. 69, pp. 5568–5579, 2021.
[45] S. Boyd and L. Vandenberghe, “Convex Optimization.” Cambridge, U.K.: Cambridge Univ. Press, 2004.
[46] D. Bertsekas, “Convex Analysis and optimization.” Singapore: Athena Scientific, 2003.
[47] P. Latafat, N. M. Freris, and P. Patrinos, “A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization,” IEEE Transactions on Automatic Control , vol. 64, no. 10, pp. 4050-4065, Oct. 2019.
[48] Xin, Ran, and Usman A. Khan. ”A linear algorithm for optimization over directed graphs with geometric convergence.” IEEE Control Systems Letters, vol. 2, no. 3, pp. 315-320, 2018.
[49] D. Kempe, A. Dobra, and J. Gehrke. ”Gossip-based computation of aggregate information” 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp. 482-491, 2003.
[50] Q. Qiu, and H. Su, ”Finite-Time Output Synchronization for Output-Coupled Reaction-Diffusion Neural Networks With Directed Topology,” in IEEE Transactions on Network Science and Engineering, vol. 9, no. 3, pp. 1386-1394, 2022.
[51] P. Wan et al., ”Optimal Control for Positive and Negative Information Diffusion Based on Game Theory in Online Social Networks,” in IEEE Transactions on Network Science and Engineering, vol. 10, no. 1, pp. 426-440, 2023.

	$\displaystyle 2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{{\mathbf{y}}}_{k}% (i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i))-\\|{\mathbf{x}}_{i}\\|^% {2}+\\|{\mathbf{x}}_{i}\\|^{2}$
	$\displaystyle=\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|(I-A)\tilde{{\mathbf{z}}% }_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt-2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{% {\mathbf{y}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(% i))-2\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1))\\|^{2}$
	$\displaystyle\hskip 10.00002pt+2\tilde{{\mathbf{x}}}^{T}_{k}(i-1)(I-A)(\tilde{% {\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{\mathbf{w}}}_{k}(i))-\\|{% \mathbf{x}}_{i}\\|^{2}$		(67)

$\displaystyle\\|\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|% \tilde{{\mathbf{x}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}-2\\|(I-% A)(\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i))\\|^{2}$
	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}+\\|% \tilde{{\mathbf{x}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|(I-A)\tilde{{\mathbf{z}}}_{k}(i)\\|^{2}-\\|(I-A% )(\tilde{{\mathbf{x}}}_{k}(i-1)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i))\\|^{2}$	(69)

	$\displaystyle\\|\tilde{{\mathbf{z}}}_{k}(i)\\|_{A^{2}}^{2}$	$\displaystyle=\underbrace{\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{% d}\tilde{{\mathbf{w}}}_{k}(i)\\|^{2}}_{\mathclap{D^{{}^{\prime}}}}-\\|\tilde{{% \mathbf{x}}}_{k}(i)\\|^{2}$
		$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|_{A^{2}}^{2}$		(70)

$\displaystyle\\|\tilde{{\mathbf{y}}}_{k}(i-1)+\mu_{y}\mathcal{C}_{d}\tilde{{% \mathbf{w}}}_{k}(i)\\|^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}+\\|\mu_{y}\mathcal{C}_{d}% \tilde{{\mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\frac{\mu_{y}}{\mu_{w}}\Big{(}\tilde{{\mathbf{w% }}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-% \nabla J(w^{*}))$
	$\displaystyle\hskip 10.00002pt-\\|{\mathbf{w}}_{k}(i)\\|^{2}+\\|\mu_{w}\mathcal{C% }_{d}\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}\Big{)}$
	$\displaystyle\hskip 10.00002pt-2\mu_{y}\mu_{w}\\|\mathcal{C}_{d}\tilde{{\mathbf% {y}}}_{k}(i-1)\\|^{2}$	(73)

$\displaystyle\\|\tilde{{\mathbf{z}}}_{k}(i)\\|_{A^{2}}^{2}$	$\displaystyle=\\|\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}+\\|\mu_{y}\mathcal{C}_{d}% \tilde{{\mathbf{w}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\frac{\mu_{y}}{\mu_{w}}\Big{(}\\|\tilde{{\mathbf% {w}}}_{k}(i-1)$
	$\displaystyle\hskip 10.00002pt-\mu_{\omega}(\nabla J({\mathbf{w}}_{k}(i-1))-% \nabla J(w^{*}))\\|^{2}$
	$\displaystyle\hskip 10.00002pt-\\|\tilde{{\mathbf{w}}}_{k}(i)\\|^{2}+\\|\mu_{w}% \mathcal{C}_{d}\tilde{{\mathbf{y}}}_{k}(i-1)\\|^{2}\Big{)}$
	$\displaystyle\hskip 10.00002pt-2\mu_{y}\mu_{w}\\|\mathcal{C}_{d}\tilde{{\mathbf% {y}}}_{k}(i-1)\\|^{2}-\\|\tilde{{\mathbf{x}}}_{k}(i)\\|^{2}$
	$\displaystyle\hskip 10.00002pt+\\|\tilde{{\mathbf{x}}}_{k}(i-1)\\|_{A^{2}}^{2}$	(74)