research-article

Federated Multi-task Graph Learning

Authors:

Wei ChenAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 13, Issue 5

Article No.: 80, Pages 1 - 27

https://doi.org/10.1145/3527622

Published: 11 June 2022 Publication History

Get Access

Abstract

Distributed processing and analysis of large-scale graph data remain challenging because of the high-level discrepancy among graphs. This study investigates a novel subproblem: the distributed multi-task learning on the graph, which jointly learns multiple analysis tasks from decentralized graphs. We propose a federated multi-task graph learning (FMTGL) framework to solve the problem within a privacy-preserving and scalable scheme. Its core is an innovative data-fusion mechanism and a low-latency distributed optimization method. The former captures multi-source data relatedness and generates universal task representation for local task analysis. The latter enables the quick update of our framework with gradients sparsification and tree-based aggregation. As a theoretical result, the proposed optimization method has a convergence rate interpolates between \(\mathcal {O}(1/T)\) and \(\mathcal {O}(1/\sqrt {T})\), up to logarithmic terms. Unlike previous studies, our work analyzes the convergence behavior with adaptive stepsize selection and non-convex assumption. Experimental results on three graph datasets verify the effectiveness and scalability of FMTGL.

A Appendix

A.2 Parameter Study

We investigate the model configuration on the DBLP dataset in the section. The first two hyperparameters are the number of D2TGNN layers \(L\) and the node representation embedding of D2TGNN \(d\). We conclude the results in Table 6, which indicates that the configuration \(L=3,d=256\) is a good setting. FMTGL has a good performance under the parameter configuration while remaining a lightweight structure.

Table 6.

	L	1	3	5	1	3	5
d	Task	MAP(%)			MF1(%)
128	AI	88.99	88.91	89.34	83.81	84.42	84.10
	System	91.10	90.97	91.16	85.93	85.99	86.48
	Theory	90.44	90.45	91.10	88.31	87.92	88.92
	Interdisciplinary	91.23	90.94	91.20	85.26	85.38	85.60
256	AI	90.60	90.49	90.49	85.01	85.09	85.17
	System	92.56	92.72	92.67	87.41	87.62	87.34
	Theory	91.72	91.91	91.61	89.24	89.28	89.05
	Interdisciplinary	92.54	92.27	92.81	86.49	86.16	86.75
512	AI	89.66	89.81	90.03	84.63	84.47	84.35
	System	92.00	91.71	91.85	87.09	87.05	87.05
	Theory	91.28	91.09	91.36	88.73	88.50	88.76
	Interdisciplinary	91.70	92.09	91.88	86.22	86.24	85.95

Table 6. Results for Selecting the Number of D2TGNN Layers \(L\) and the Representation Dimension \(d\)

We then discuss the selection of the number of support vectors for constructing fusion space \(h\) and the reduction dimension of node representation for computing the correction term of adjacency. The results are summarized in Table 7, and we find that our method is robust to the selection of \(h,c\). Moreover, the results indicate the optimal configuration is \(h=8,c=10\).

Table 7.

	c	5	10	20	5	10	20
h	Task	MAP(%)			MF1(%)
4	AI	90.02	90.19	89.84	84.82	85.00	84.60
	System	91.96	92.05	92.19	86.97	87.31	87.31
	Theory	91.63	91.44	91.31	89.01	88.79	89.13
	Interdisciplinary	92.01	91.95	91.78	86.03	86.35	85.92
8	AI	90.06	89.82	89.83	84.60	84.52	84.71
	System	91.92	92.33	91.90	87.07	87.33	87.28
	Theory	91.21	91.37	91.45	88.65	88.71	89.12
	Interdisciplinary	91.88	92.20	91.83	86.14	86.44	86.04
16	AI	90.12	89.96	89.44	84.64	84.86	84.01
	System	91.92	91.71	91.77	86.95	86.56	86.61
	Theory	91.52	91.62	91.30	89.03	89.04	88.88
	Interdisciplinary	91.42	92.10	91.76	86.10	86.36	85.84

Table 7. Results for Selecting the Reduction Dimension \(c\) and the Number of Support Vectors \(h\)

A.3 Theoretical Analysis

A.3.1 Proof of Lemma 1.

As \(f(\mathrm{W})=\frac{1}{m}\Sigma _{i=1}^ml_i(\mathrm{w}_i)\), we have:

\begin{equation} \begin{array}{ll} \displaystyle ||\nabla f(\mathrm{W}_1)-\nabla f(\mathrm{W}_2)||^2 &=\ \frac{1}{m^2}\Bigg \Vert \sum _{i=1}^m\nabla l_i(\mathrm{w}_i^1)-\nabla l_i(\mathrm{w}_i^2)\Bigg \Vert ^2\\ &\displaystyle \le \ \frac{1}{m}\sum _{i=1}^m||\nabla l_i(\mathrm{w}_i^1)-\nabla l_i(\mathrm{w}_i^2)||^2\\ &\displaystyle \le \ \frac{L^2}{m}\sum _{i=1}^m||\mathrm{w}_i^1-\mathrm{w}_i^2||^2=\frac{L^2}{m}||\mathrm{W}_1-\mathrm{W}_2||^2. \end{array} \end{equation}

(23)

Thus, we derive the Lemma 1.

A.3.2 Proof of Lemma 2.

Since \(f(\mathrm{W})=\frac{1}{m}\Sigma _{i=1}^ml_i(\mathrm{w}_i)\), we have:

\begin{equation} \begin{array}{ll} \displaystyle ||\nabla f(\mathrm{W})||^2&=\ \Bigg \Vert \frac{1}{m}\sum _{i=1}^m\nabla l_i(w_i)\Bigg \Vert ^2\\ & \quad \ \displaystyle \Bigg \Vert \frac{1}{m}\sum _{i=1}^m\Bigg (\frac{\partial l_i}{\partial w}(w^i),\dots ,\frac{\partial l_i}{\partial w^i}(w),\dots \Bigg)\Bigg \Vert ^2\\ \displaystyle &=\ \frac{1}{m^2}\Bigg \Vert \Bigg (\sum _{i=1}^m\frac{\partial l_i}{\partial w}(w^i),\dots ,\frac{\partial l_i}{\partial w^i}(w),\dots \Bigg)\Bigg \Vert ^2\\ \displaystyle &=\ \frac{1}{m^2}\Bigg \Vert \sum _{i=1}^m\frac{\partial l_i}{\partial w}(w^i)\Bigg \Vert ^2+\frac{1}{m^2}\sum _{i=1}^m\Bigg \Vert \frac{\partial l_i}{\partial w^i}(w))\Bigg |^2\\ \displaystyle &=\ ||\nabla f(w;w^1,\dots ,w^m)||^2+\sum _{i=1}^m\Bigg \Vert \frac{1}{m}\nabla l_i(w^i;w)\Bigg \Vert ^2. \end{array} \end{equation}

(24)

A.3.3 Proof of Lemma 3.

According to H(1), we have:

\begin{equation} \begin{array}{ll} & ||\nabla l_i((\mathrm{w},\mathrm{w}_x^i))-\nabla l_i((\mathrm{w},\mathrm{w}_y^i))||^2\\ &\le \ L^2||(\mathrm{w},\mathrm{w}_x^i)-(\mathrm{w},\mathrm{w}_y^i)||^2\\ &=\ L^2||\mathrm{w}_x^i-\mathrm{w}_y^i||^2. \end{array} \end{equation}

(25)

Thus, we derive the first inequality. Since second equation of Lemma 3 is trivial, we prove the third inequality here:

\begin{equation} \begin{aligned}&||\nabla l_i (\mathrm{w},\mathrm{w}^i)-G^i(\mathrm{w},\mathrm{w}^i,\xi)||^2\\ &=||\nabla l_i(\mathrm{w};\mathrm{w}^i)-G^i(\mathrm{w},\xi ;\mathrm{w}^i)||^2\\ &\quad \ +\ ||\nabla l_i(\mathrm{w}^i;\mathrm{w})-G^i(\mathrm{w}^i,\xi ;\mathrm{w})||^2\\ &\ge l_i(\mathrm{w}^i;\mathrm{w})-G^i(\mathrm{w}^i,\xi ;\mathrm{w})||^2. \end{aligned} \end{equation}

(26)

Because \(e^{\frac{x^2}{\sigma ^2}}\) is an increasing function, we derive the third inequality with the expectation operator and H(3).

A.3.4 Corollary of Hypothesis 4.

With H(4), we further derive the following lemma.

Lemma 4.

\(\forall x^i\in \mathrm{R}^d,i=1,\dots ,m\), and \(0\lt k\le d\), it holds that:

\begin{equation} \Bigg \Vert \sum _{i=1}^m x^i-gTopK_{i=1}^mx^i\Bigg \Vert ^2\le \left(1-\frac{k}{d}\right)\Bigg \Vert \sum _{i=1}^m x^i\Bigg \Vert ^2. \end{equation}

(27)

Proof. We have \(\mathrm{E}_{\xi }||(RandomK(x,\xi)||^2=\frac{k}{d}||x||^2\), according to H(4), we obtain the Equation (27).

A.3.5 Proof of Theorem 2.

Firstly, we define an auxiliary random variable \(x_t\) at iteration \(t\):

\begin{equation} x_{t+1}=x_t-\alpha _tG_t(\mathrm{w}_t,\xi _t), \end{equation}

(28)

where \(G_t(\mathrm{w}_t,\xi _t)=\frac{1}{m}\Sigma _{i=1}^mG^i_t(\mathrm{w}_t,\xi _t^i)\) and \(x_0=0\). The difference between \(x_t\) and the model parameter \(\mathrm{w}_t\) is defined as: \(\epsilon _t = \mathrm{w}_t-x_t\). According to Algorithm 1, we can verify the equation by induction method:

\begin{equation} \epsilon _t = \frac{1}{m}\sum _{i=1}^m\epsilon _t^i. \end{equation}

(29)

The following lemma gives an estimate about the second moment of \(\epsilon _t\):

Lemma 5.

For any iteration \(t\),

\begin{equation} \mathrm{E}[||\epsilon _t||^2]\le \left(1+\frac{1}{\eta }\right)\gamma \mathrm{E}[||\alpha _tG_t(\mathrm{w}_t,\xi _t)||^2]+\gamma (1+\eta)\mathrm{E}[||\epsilon _{t-1}||^2], \end{equation}

(30)

where \(\gamma =1-\frac{k}{d},0\lt k\le d\). \(\eta \gt 0\) is an arbitrarily selected constant.

Lemma 6.

If x \(\ge\) 0 and \(x\le C(A+Bx)^{\epsilon +1/2}\), then,

\begin{equation} x\le \max \big (\big [C(2B)^{1/2+\epsilon }\big ]^{\frac{1}{1/2-\epsilon }},C(2A)^{1/2+\epsilon }\big). \end{equation}

(31)

Our analysis is mainly based on Lemma 7. However, there is a large gap between Theorem 2 and Lemma 7. In Lemmas 8–10, we derive several precise estimations to bridge the gap.

Lemma 7.

Assuming H(1), H(2), then for any \(T\ge 1\), the iterates of MTgTop-k S-SGD satisfy the following estimate:

\begin{equation} \begin{aligned}\frac{3}{4}\mathrm{E}\left[\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2\right] \le f(w_1)-f^*+\sum _{t=1}^T\alpha _t\tilde{L}^2\mathrm{E}[||\epsilon _t||^2]+\frac{\tilde{L}}{2}\mathrm{E}\big [||\alpha _tG_t(\mathrm{w}_t,\xi _t)||^2\big ]. \end{aligned} \end{equation}

(32)

Lemma 8.

Assuming H(3), for any iteration \(T\), we have the the following estimation:

\begin{equation} \mathrm{E}\left[\max _{1\le t\le T}\Bigg \Vert \frac{1}{m}\sum _{i=1}^m\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)\Bigg \Vert ^2\right]\le \sigma ^2\left(\frac{1}{m}+\ln T\right). \end{equation}

(33)

Lemma 9.

Let \(\alpha _t\) equals to the Equation (10), then we give the following bound for the gradients at each iteration \(T\):

\begin{equation} \begin{aligned}\mathrm{E}\left[\sum _{t=1}^T\alpha _t^2||G_t(\mathrm{w}_t,\xi _t)||^2\right] &\le \frac{\alpha ^2}{2\epsilon \beta ^{2\epsilon }}+2(\alpha _1-\alpha _{T+1})\mathrm{E}[\max _{1\le t\le T}\alpha _t||G_t(\mathrm{w}_t,\xi _t)||^2]\\ \mathrm{E}\left[\sum _{t=1}^T\alpha _t^3||G_t(\mathrm{w}_t,\xi _t)||^2\right] &\le \frac{2\alpha ^3}{(6\epsilon +1)\beta ^{3\epsilon +1/2}}+(\alpha _1-\alpha _{T+1})\frac{\alpha ^2}{2\epsilon \beta ^{2\epsilon }}\\ &\quad \ +\ (\alpha _1^2-\alpha _{T+1}^2)\mathrm{E}[\max _{1\le t\le T}\alpha _t||G_t(\mathrm{w}_t,\xi _t)||^2]. \end{aligned} \end{equation}

(34)

For simplicity, we denote the \(\mathrm{E}[\Sigma _{t=1}^T\alpha _t^2||G_t(\mathrm{w}_t,\xi _t)||^2]\) as \(Q^2_T\), \(\mathrm{E}[\Sigma _{t=1}^T\alpha _t^3||G_t(\mathrm{w}_t,\xi _t)||^2]\) as \(Q^3_T\), respectively. Another essential bound for our analysis is given by the following lemma.

Lemma 10.

For any iteration \(T\):

\begin{equation} \mathrm{E}\left[\sum _{t=1}^T\alpha _t||\epsilon _t||^2\right]\le \frac{1}{\eta }Q_T^3\sum _{t=1}^{T-1}((1+\eta)\gamma)^t. \end{equation}

(35)

Similarly, \(\gamma = 1-\frac{k}{d},0\lt k\le d\) and \(\eta \gt 0\). we can further simplify the r.h.s.: Define \(\gamma (1+\eta)=\tau\), and select \(\eta\) such that \(\tau \lt 1\).

\begin{equation} \mathrm{E}\left[\sum _{t=1}^T\alpha _t||\epsilon _t||^2\right]\le \frac{\tau Q_T^3}{\eta (1-\tau)}. \end{equation}

(36)

Proof. According to Lemma 7:

\begin{equation} \begin{aligned}\frac{3}{4} \mathrm{E}\Bigg [\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2\Bigg ]\le f(w_1)-f^*+\sum _{t=1}^T\alpha _t\tilde{L}^2\mathrm{E}[||\epsilon _t||^2] +\frac{\tilde{L}}{2}\mathrm{E}[||\alpha _tG_t(\mathrm{w}_t,\xi _t)||^2]. \end{aligned} \end{equation}

(37)

Using Lemma 10, it holds that:

\begin{equation} \frac{3}{4}\mathrm{E}\Bigg [\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2\Bigg ]\le f(w_1)-f^*+\frac{\tau \tilde{L}^2Q_T^3}{\eta (1-\tau)}+\frac{\tilde{L}Q_T^2}{2}. \end{equation}

(38)

Expand r.h.s. with Lemma 9, we derive that:

\begin{equation} \begin{aligned} \frac{\tau \tilde{L}^2Q_T^3}{\eta (1-\tau)}+\frac{\tilde{L}Q_T^2}{2}\le & \left(\frac{\tau \alpha _1^2\tilde{L}^2}{\eta (1-\tau)}+\alpha _1\tilde{L}\right)\mathrm{E}[\max _{1\le t \le T}\alpha _t||G_t(\mathrm{w}_t,\xi _t)||^2] +K_1. \end{aligned} \end{equation}

(39)

Further, according to Lemma 8:

\begin{equation} \begin{aligned}\mathrm{E}\left[\max _{1\le t\le T}\alpha _t||G_t(\mathrm{w}_t,\xi _t)||^2\right]&\le \mathrm{E}[\max _{1\le t\le T}\alpha _t(2||G_t(\mathrm{w}_t,\xi _t)-\nabla f(\mathrm{w}_t)||^2 +2||\nabla f(\mathrm{w}_t)||^2)]\\ &\le \alpha _1 \sigma ^2\left(\frac{1}{m}+\ln T\right) +\mathrm{E}[\max _{1\le t\le T}\alpha _t||\nabla f(\mathrm{w}_t)||^2]\\ &\le \alpha _1 \sigma ^2\left(\frac{1}{m}+\ln T\right) +\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2 \end{aligned} \end{equation}

(40)

Combine Equations (38), (39), and (40), it implies that:

\begin{equation} \begin{aligned} &\frac{3}{4}\mathrm{E}\Bigg [\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2\Bigg ]\le f(w_1)-f^*+ \Bigg (\frac{\tau \alpha _1^2\tilde{L}^2}{\eta (1-\tau)}+\alpha _1\tilde{L}\Bigg)\Bigg (\alpha _1\sigma ^2\Bigg (\frac{1}{m}+\ln T\Bigg)+\\ &\mathrm{E}\left[\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2\right]\Bigg)+K_1. \end{aligned} \end{equation}

(41)

Where \(K_1\) equals to \(\frac{2\tau \tilde{L}^2\alpha ^3}{\eta (1-\tau)(6\epsilon +1)\beta ^{3\epsilon +1/2}}+\frac{\tau \alpha _1\alpha ^2\tilde{L}^2}{2\eta (1-\tau)\epsilon \beta ^{2\epsilon }}+\frac{\alpha ^2\tilde{L}}{4\epsilon \beta ^{2\epsilon }}\), \(\alpha _1 = \frac{\alpha }{\beta ^{\epsilon +1/2}}\). We rearrange the Equation (41) by transposing the \(\mathrm{E}[\Sigma _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2]\) to l.h.s, and let \(\kappa\) equals to the r.h.s of the result.

\begin{equation} \mathrm{E}\Bigg [\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2\Bigg ]\le \kappa . \end{equation}

(42)

For simplicity, we define \(\Delta = \Sigma _{t=1}^T||\nabla f(\mathrm{w}_t)||^2\). According to \(H\ddot{o}lder^{\prime }s\) inequality, we give the l.h.s. lower bound of Equation (42):

\begin{equation} \begin{aligned}\mathrm{E}\Bigg [\sum _{t=1}^T\alpha _t||\nabla f(\mathrm{w}_t)||^2\Bigg ]&\ge \mathrm{E}[\alpha _T\Delta ]\\ &\ge \frac{(\mathrm{E}[\Delta ^{1/2-\epsilon }])^{\frac{1}{1/2-\epsilon }}}{(\mathrm{E}[(\frac{1}{\alpha _T})^{\frac{1/2-\epsilon }{1/2+\epsilon }}])^{\frac{1/2+\epsilon }{1/2-\epsilon }}}. \end{aligned} \end{equation}

(43)

For \(\frac{1}{\alpha _T}\), it holds that:

\begin{equation} \begin{aligned}\frac{1}{\alpha _T} &= \frac{1}{\alpha }\Bigg (\beta +\sum _{t=1}^{T-1}||G_t(\mathrm{w}_t)||^2\Bigg)^{1/2+\epsilon }\\ &\le \frac{1}{\alpha }\Bigg (\beta +2\sum _{t=1}^{T-1}(||\nabla f(\mathrm{w}_t)-G_t(\mathrm{w}_t)||^2+||\nabla f(\mathrm{w}_t)||^2)\Bigg)^{1/2+\epsilon }. \end{aligned} \end{equation}

(44)

Combining Equation (42), (43), and (44) and H(3), we derive that:

\begin{equation} \begin{aligned}(\mathrm{E}[\Delta ^{1/2-\epsilon }])^{\frac{1}{1/2-\epsilon }}&\le \kappa \Bigg (\mathrm{E}\Bigg [(\frac{1}{\alpha _T})^{\frac{1/2-\epsilon }{1/2+\epsilon }}\Bigg ]\Bigg)^{\frac{1/2+\epsilon }{1/2-\epsilon }} \le \kappa \Bigg (\mathrm{E}\Bigg [\Bigg (\beta +2\sum _{t=1}^{T-1}||\nabla f(w_t)-G(w_t)||^2\Bigg)^{1/2-\epsilon }\Bigg ]\\ &\quad \ +\ 2\mathrm{E}\Bigg [\Bigg (\sum _{t=1}^{T-1}||\nabla f(w_t)||^2\Bigg)^{1/2-\epsilon }\Bigg ]\Bigg)^{\frac{1/2+\epsilon }{1/2-\epsilon }}\\ &\le \kappa ((\beta +2T\sigma ^2)^{1/2-\epsilon }+2\mathrm{E}[\Delta ^{1/2-\epsilon }])^{\frac{1/2+\epsilon }{1/2-\epsilon }}. \end{aligned} \end{equation}

(45)

Noticing that we have the following lower bound for \(\mathrm{E}[\Delta ^{1/2-\epsilon }]\):

\begin{equation} T^{1/2-\epsilon }\mathrm{E}[\min _{1\le t\le T}||\nabla f(\mathrm{W}_t)||^{1-2\epsilon }]\le \mathrm{E}[\Delta ^{1/2-\epsilon }], \end{equation}

(46)

according to Lemma 6, we derive the Theorem 2.

A.3.6 Proof of Lemma 5.

Note that \(\epsilon _t = \mathrm{w}_t-x_t\):

\begin{equation} \begin{aligned}\mathrm{E}[||w_{t+1}-x_{t+1}||^2] &=\mathrm{E}\Bigg [\Bigg \Vert \frac{1}{m}\sum _{i}^m(\alpha _tG_t^i(\mathrm{w}_t,\xi _t)+\epsilon _t^i)+\mathrm{w}_t-x_t-\epsilon _t\\ &\quad \ -\ \frac{1}{m}gTopK_{i=1}^m(\alpha _t(G_t^i(\mathrm{w}_t,\xi _t)+\epsilon _t^i))\Bigg \Vert ^2\Bigg ]\\ &=\ \mathrm{E}\Bigg [\Bigg \Vert \frac{1}{m}\sum _{i}^m(\alpha _tG_t^i(\mathrm{w}_t,\xi _t)+\epsilon _t^i)\\ &\quad \ -\ \frac{1}{m}gTopK_{i=1}^m(\alpha _t(G_t^i(\mathrm{w}_t,\xi _t)+\epsilon _t^i)\Bigg \Vert ^2\Bigg ]\\ &\le \ \gamma \mathrm{E}\Bigg [\Bigg \Vert \frac{1}{m}\sum _{i}^m(\alpha _tG_t^i(\mathrm{w}_t,\xi _t)+\epsilon _t^i)\Bigg \Vert ^2\Bigg ]\\ &=\ \gamma \mathrm{E}\Bigg [\Bigg \Vert \alpha _tG_t(\mathrm{w}_t,\xi _t)+\mathrm{w}_t-x_t\Bigg \Vert ^2\Bigg ]\\ &\le \ \gamma \left(1+\frac{1}{\eta }\right)\mathrm{E}||\alpha _tG_t(\mathrm{w}_t,\xi _t)||^2+\gamma (1+\eta)\mathrm{E}||\mathrm{w}_t-x_t||^2, \end{aligned} \end{equation}

(47)

by substituting \(w_{t+1}-x_{t+1}\) with \(\epsilon _{t+1}\), we get the result.

A.3.7 Proof of Lemma 6.

If \(A\le Bx\), then \(x\le C(2Bx)^{\frac{1}{2}+\epsilon }\), move \(x\) to l.h.s., we have \(x\le [C(2B)^{\frac{1}{2}+\epsilon }]^{\frac{1}{1/2-\epsilon }}\).

If \(A\gt Bx\), then \(x\lt C(2A)^{\frac{1}{2}+\epsilon }\). The proof is finished by taking the maximum of two estimates.

A.3.8 Proof of Lemma 7.

Under assumption H(1), we have

\begin{equation} \begin{aligned}f(x_{t+1})\le &f(x_t)+\alpha _t\nabla f(x_t)^\mathrm{T}(\nabla f(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)) +\frac{\tilde{L}}{2}||x_{t+1}-x_t||^2, \end{aligned} \end{equation}

(48)

since \(x_{t+1}-x_t=-\alpha _tG_t(\mathrm{w}_t)\), we have

\begin{equation} \begin{aligned}&f(x_{t+1})\le f(x_t) +\alpha _t \nabla f(x_t)^\mathrm{T}(\nabla f(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)) -\alpha _t\nabla f(x_t)^\mathrm{T}\nabla f(\mathrm{w}_t)+\frac{\tilde{L}}{2}||\alpha _t G_t(\mathrm{w}_t)||^2, \end{aligned} \end{equation}

(49)

according to assumption H(2), we have

\begin{equation} \mathrm{E}[\alpha _t \nabla f(x_t)^\mathrm{T}(\nabla f(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t))]=0, \end{equation}

(50)

besides, it holds that:

\begin{equation} \begin{aligned}-\alpha _t\nabla f(x_t)^\mathrm{T}\nabla f(\mathrm{w}_t) &= -\frac{\alpha _t}{2}||\nabla f(x_t)||^2-\frac{\alpha _t}{2}||\nabla f(\mathrm{w}_t)||^2 +\frac{\alpha _t}{2}||\nabla f(x_t)-\nabla f(\mathrm{w}_t)||^2. \end{aligned} \end{equation}

(51)

According to H(1):

\begin{equation} \begin{aligned}-\alpha _t\nabla f(x_t)^\mathrm{T}\nabla f(\mathrm{w}_t) & \le -\frac{\alpha _t}{2}||\nabla f(x_t)||^2-\frac{\alpha _t}{2}||\nabla f(\mathrm{w}_t)||^2 +\frac{\alpha _t\tilde{L}^2}{2}||\mathrm{w}_t-x_t||^2. \end{aligned} \end{equation}

(52)

We take expectation to the Equation (49),

\begin{equation} \begin{aligned}\mathrm{E}\Bigg [\frac{\alpha _t}{2}(||\nabla f(x_t)||^2+\tilde{L}^2||\mathrm{w}_t-x_t||^2)\Bigg ]&\le \mathrm{E}\Bigg [f(x_t)-f(x_{t+1}) -\frac{\alpha _t}{2}||\nabla f(\mathrm{w}_t)||^2+\alpha _t\tilde{L}^2||\mathrm{w}_t-x_t||^2\\ &\quad \ +\ \frac{\tilde{L}}{2}||\alpha _tG_t(\mathrm{w}_t,\xi _t)||^2\Bigg ]. \end{aligned} \end{equation}

(53)

Furthermore, we apply H(1) again and derive the following equation:

\begin{equation} \frac{1}{2}||\nabla f(\mathrm{w}_t)||^2\le 2||\nabla f(x_t)||^2+2\tilde{L}^2||\mathrm{w}-x_t||^2. \end{equation}

(54)

After simplification, we have

\begin{equation} \begin{aligned}\frac{3}{4}\mathrm{E}[\alpha _t||\nabla f(\mathrm{w}_t)||^2]&\le \mathrm{E}[f(x_t)]-\mathrm{E}[f(x_{t+1})]+ \mathrm{E}[\alpha _tL^2||\mathrm{w}_t-x_t||^2]+\frac{\tilde{L}}{2}\mathrm{E}[||\alpha _tG_t(\mathrm{w}_t,\xi _t)||^2], \end{aligned} \end{equation}

(55)

summing the above equation over \(t=1,\ldots ,T\), we then prove the Lemma 7 with the fact that \(f^*\le f(x),\forall x\).

A.3.9 Proof of Lemma 8.

According to the Jensen’s inequality, we have

\begin{equation} \begin{aligned}&\exp \Bigg (\frac{\mathrm{E}[\max _{1\le t\le T}||\frac{1}{m}\sum _i^m\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)||^2]}{\sigma ^2}\Bigg)\\ &\quad \ \le \ \mathrm{E}\Bigg [\exp \Bigg (\frac{\max _{1\le t\le T}\frac{1}{m^2}\sum _i^m||\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)||^2}{\sigma ^2}\Bigg)\Bigg ]. \end{aligned} \end{equation}

(56)

Notice that \(\exp (\cdot)\) is an increasing function, the r.h.s equals to:

\begin{equation} \mathrm{E}\Bigg [\max _{1\le t\le T}\exp \Bigg (\frac{\sum _i^m||\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)||^2}{m^2\sigma ^2}\Bigg)\Bigg ]. \end{equation}

(57)

Applying Jensen’s inequality again, we find that the Equation (57) is less than or equal to the following one:

\begin{equation} \mathrm{E}\Bigg [\max _{1\le t\le T}\frac{1}{m}\sum _i^m\exp \Bigg (\frac{||\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)||^2}{m\sigma ^2}\Bigg)\Bigg ]. \end{equation}

(58)

Equation (58) is less than or equal to:

\begin{equation} \begin{aligned} & \frac{1}{m}\sum _i^m\sum _t^T\mathrm{E}\Bigg [\exp \Bigg (\frac{||\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _{t})||^2}{m\sigma ^2}\Bigg)\Bigg ]\\ & \quad \ =\ \frac{1}{m}\sum _i^m\sum _{t=1}^T\mathrm{E}\Bigg [\mathrm{E}_i\Bigg [\exp (\frac{||\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _{t})||^2}{m\sigma ^2})&\Bigg ]\Bigg ]. \end{aligned} \end{equation}

(59)

Note that \(x^{\frac{1}{m}}\) is a concave function of \(x\), applying Jensen’s inequality again. We find that Equation (59) is less than or equal to:

\begin{equation} \frac{1}{m}\sum _i^m\sum _{t=1}^T\mathrm{E}\Bigg [\Bigg (\mathrm{E}_i\Bigg [\exp \Bigg (\frac{||\nabla l_i(\mathrm{w}_t)-G_t(\mathrm{w}_t,\xi _t)||^2}{\sigma ^2}\Bigg)\Bigg ]\Bigg)^{\frac{1}{m}}\Bigg ]. \end{equation}

(60)

According to H(3), Equation (60) is less than or equal to \(Te^{\frac{1}{m}}\). Therefore, we finish the proof.

A.3.10 Proof of Lemma 9.

To get Lemma 9, we introduce another lemma at first:

Lemma 11.

Let \(a_0\gt 0, a_i\ge 0,i=1,\ldots ,T\) and \(\beta \gt 1\). Then,

\begin{equation} \sum _{t=1}^T\frac{a_t}{(a_0+\sum _{i=1}^{t}a_i)^\beta }\le \frac{1}{(\beta -1)a_0^{\beta -1}}. \end{equation}

(61)

We now turn to prove Lemma 11. Assuming that \(f:[0,+\infty)\rightarrow [0,+\infty)\) is a non-increasing function, we have:

\begin{equation} \sum _{t=1}^Ta_tf(a_0+\sum _{i=1}^t a_i)\le \int ^{\Sigma _{t=0}^Ta_t}_{a_0}f(x)dx. \end{equation}

(62)

Let \(s_t=\Sigma _{i=0}^ta_i\), and we obtain the following inequality:

\begin{equation} a_if(s_i)=\int _{s_{i-1}}^{s_i} f(s_i)dx\le \int _{s_{i-1}}^{s_i} f(x)dx. \end{equation}

(63)

Summing the index \(i\) from 1 to \(T\), we obtain Equation (62). The proof of Lemma 11 is finished by applying the Equation (62).

Proof: We then prove the first estimate:

\begin{equation} \begin{aligned} \mathrm{E}\Bigg [\sum _{t=1}^T\alpha _t^2||G_t(\mathrm{w}_t,\xi _t)||^2\Bigg ]&=\mathrm{E}\Bigg [\sum _{t=1}^T\alpha _{t+1}^2||G_t^i(\mathrm{w}_t,\xi _t)||^2\\ &\quad +\ \sum _{t=1}^T||G_t(\mathrm{w}_t^i,\xi _t)||^2(\alpha _t^2-\alpha _{t+1}^2)\Bigg ].\\ \end{aligned} \end{equation}

(64)

Notice that \((\alpha _t^2-\alpha _{t+1}^2) = (\alpha _t+\alpha _{t+1})(\alpha _t-\alpha _{t+1})\), and \(\lbrace \alpha _t\rbrace\) is decreasing with respect to \(t\). Equation (64) is less than or equal to the following one:

\begin{equation} \mathrm{E}\Bigg [\sum _{t=1}^T\alpha _{t+1}^2||G_t^i(\mathrm{w}_t,\xi _t)||^2+\sum _{t=1}^T2\alpha _t||G_t(\mathrm{w}_t,\xi _t)||^2(\alpha _t-\alpha _{t+1})\Bigg ]. \end{equation}

(65)

According to the definition of \(\alpha _t\),

\begin{equation} \sum _{t=1}^T\alpha _{t+1}^2||G_t(\mathrm{w}_t^i,\xi _t)||^2 = \sum _{t=1}^T\frac{\alpha ^2 ||G_t(\mathrm{w}_t,\xi _t)||^2}{(\beta +\sum _{i=1}^{t}||G_i(w_i,\xi _i)||^2)^{1+2\epsilon }}. \end{equation}

(66)

Applying Lemma 11, we find that Equation (66) is less than or equal to \(\frac{\alpha ^2}{2\epsilon \beta ^{2\epsilon }}\). Furthermore, we have:

\begin{equation} \begin{aligned}&\sum _{t=1}^T2\alpha _t||G_t(\mathrm{w}_t,\xi _t)||^2(\alpha _t-\alpha _{t+1})\le 2\max _{1\le t\le T}\alpha _t||G_t(\mathrm{w}_t,\xi _t)||^2\sum _{t=1}^T(\alpha _t-\alpha _{t+1}). \end{aligned} \end{equation}

(67)

Combining the above results, we obtain the first estimation. For the second estimate:

\begin{equation} \begin{aligned}\sum _{t=1}^T\alpha _t^3||G_t(\mathrm{w}_t,\xi _t)||^2&=\sum _{t=1}^T\alpha _{t+1}^3||G_t(\mathrm{w}_t,\xi _{t})||^2 +\sum _{t=1}^T(\alpha _t^3-\alpha _{t+1}^3)||G_t(\mathrm{w}_t,\xi _{t})||^2. \end{aligned} \end{equation}

(68)

Similarly, according to the definition of \(\alpha _t\):

\begin{equation} \begin{aligned}\sum _{t=1}^T\alpha _{t+1}^3||G_t(\mathrm{w}_t^i,\xi _{t})||^2 = \sum _{t=1}^T\frac{\alpha ^3 ||G_t(\mathrm{w}_t,\xi _{t})||^2}{(\beta +\sum _{i=1}^{t}||G_i(w_i,\xi _{t})||^2)^{3/2+3\epsilon }}. \end{aligned} \end{equation}

(69)

Applying Lemma 11 again, we find that the Equation (69) is less than or equal to \(\frac{2\alpha ^3}{(6\epsilon +1)\beta ^{3\epsilon +1/2}}\). Besides, we have:

\begin{equation} \begin{aligned}\sum _{t=1}^T(\alpha _t^3-\alpha _{t+1}^3)||G_t(\mathrm{w}_t,\xi _{t})||^2 &= \sum _{t=1}^T(\alpha _t-\alpha _{t+1})(\alpha _t^2+\alpha _t\alpha _{t+1}+\alpha _{t+1}^2)||G_t(\mathrm{w}_t,\xi _{t})||^2\\ &=\ \sum _{t=1}^T(\alpha _t-\alpha _{t+1})(\alpha _t^2+\alpha _t\alpha _{t+1})||G_t(\mathrm{w}_t,\xi _{t})||^2\\ &\quad \ +\ (\alpha _t-\alpha _{t+1})\alpha _{t+1}^2||G_t(\mathrm{w}_t,\xi _{t})||^2\\ &\le \ \sum _{t=1}^T\alpha _t(\alpha _t^2-\alpha _{t+1}^2)||G_t(\mathrm{w}_t,\xi _{t})||^2\\ &\quad \ +\ \max _{1\le t\le T}(\alpha _t-\alpha _{t+1})\sum _{t=1}^T\alpha _{t+1}^2||G_t(\mathrm{w}_t,\xi _{t})||^2\\ &\le \ \max _{1\le t\le T}\alpha _t||G_t(\mathrm{w}_t,\xi _{t})||^2(\alpha _1^2-\alpha _{T+1}^2)+\max _{1\le t\le T}(\alpha _t-\alpha _{t+1})\frac{\alpha ^2}{2\epsilon \beta ^{2\epsilon }}. \end{aligned} \end{equation}

(70)

Combining the above results, we obtain the second estimate of Lemma 9.

A.3.11 Proof of Lemma 10.

Let \(\Sigma _{t=1}^T\alpha _t||\epsilon _t||^2=S_T\), according to Lemma 5:

\begin{equation} \begin{aligned}&\sum _{t=1}^T\alpha _t||\epsilon _t||^2\le \left(1+\frac{1}{\eta }\right)\gamma \sum _{t=1}^T\alpha _t^3||G_t(\mathrm{w}_t,\xi _t)||^2 +(1+\eta)\gamma \sum _{t=1}^T\alpha _{t-1}||\epsilon _{t-1}||^2. \end{aligned} \end{equation}

(71)

After substitution, we derive:

\begin{equation} S_T \le \Bigg (1+\frac{1}{\eta }\Bigg)\gamma Q_T^3+(1+\eta)\gamma S_{T-1}. \end{equation}

(72)

Using induction method, and notice that \(S_1 = 0\):

\begin{equation} \begin{aligned}S_T&\le \frac{1}{\eta }\sum _{t=1}^{T-1}((1+\eta)\gamma)^tQ_{T-t+1}^3\\ &\le \frac{1}{\eta }\max _{2\le t\le T}Q_{t}^3\sum _{t=1}^{T-1}((1+\eta)\gamma)^t. \end{aligned} \end{equation}

(73)

The \(Q_T^3\) is a non-decreasing function of \(t\), thus \(\max _{2\le t\le T}Q_{t}^3=Q_T^3\). Finally, we derive the following estimate:

\begin{equation} S_T \le \frac{1}{\eta }Q_T^3\sum _{t=1}^{T-1}((1+\eta)\gamma)^t. \end{equation}

(74)

References

[1]

Amr Ahmed, Abhimanyu Das, and Alexander J. Smola. 2014. Scalable hierarchical multitask learning algorithms for conversion optimization in display advertising. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. 153–162.

Abstract

A Appendix

A.2 Parameter Study

A.3 Theoretical Analysis

A.3.1 Proof of Lemma 1.

A.3.2 Proof of Lemma 2.

A.3.3 Proof of Lemma 3.

A.3.4 Corollary of Hypothesis 4.

A.3.5 Proof of Theorem 2.

A.3.6 Proof of Lemma 5.

A.3.7 Proof of Lemma 6.

A.3.8 Proof of Lemma 7.

A.3.9 Proof of Lemma 8.

A.3.10 Proof of Lemma 9.

A.3.11 Proof of Lemma 10.

References

Cited By

Index Terms

Recommendations

Stock Predictor with Graph Laplacian-Based Multi-task Learning

Mitigating the performance sacrifice in DP-satisfied federated settings through graph contrastive learning

DGE-GSIM: A multi-task dual graph embedding learning for graph similarity computation

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Share

Share this Publication link

Share on social media

Affiliations