research-article

Auto-weighted Robust Federated Learning with Corrupted Data Sources

Authors:

Shenghui Li,

Edith Ngai,

Fanghua Ye,

Thiemo VoigtAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 13, Issue 5

Article No.: 73, Pages 1 - 20

https://doi.org/10.1145/3517821

Published: 11 June 2022 Publication History

Get Access

Abstract

Federated learning provides a communication-efficient and privacy-preserving training process by enabling learning statistical models with massive participants without accessing their local data. Standard federated learning techniques that naively minimize an average loss function are vulnerable to data corruptions from outliers, systematic mislabeling, or even adversaries. In this article, we address this challenge by proposing Auto-weighted Robust Federated Learning (ARFL), a novel approach that jointly learns the global model and the weights of local updates to provide robustness against corrupted data sources. We prove a learning bound on the expected loss with respect to the predictor and the weights of clients, which guides the definition of the objective for robust federated learning. We present an objective that minimizes the weighted sum of empirical risk of clients with a regularization term, where the weights can be allocated by comparing the empirical risk of each client with the average empirical risk of the best \(p\) clients. This method can downweight the clients with significantly higher losses, thereby lowering their contributions to the global model. We show that this approach achieves robustness when the data of corrupted clients is distributed differently from the benign ones. To optimize the objective function, we propose a communication-efficient algorithm based on the blockwise minimization paradigm. We conduct extensive experiments on multiple benchmark datasets, including CIFAR-10, FEMNIST, and Shakespeare, considering different neural network models. The results show that our solution is robust against different scenarios, including label shuffling, label flipping, and noisy features, and outperforms the state-of-the-art methods in most scenarios.

Appendices

A Proof of Theorem 1

Proof.

Write:

\begin{equation} \mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}\left(h\right) \le \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}\left(h\right) + \sup _{f\in \mathcal {H}}\left(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right)- \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right)\right) . \end{equation}

(12)

To link the second term to its expectation, we prove the following:

Lemma 1.

Define the function \(\phi :\left(\mathcal {X}\times \mathcal {Y}\right)^m \rightarrow \mathbb {R}\) by:

\[\phi \left(\lbrace x_{1,1}, y_{1,1}\rbrace , \ldots , \lbrace x_{N, m_N}, y_{N, m_N}\rbrace \right) = \sup _{f\in \mathcal {H}}\left(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right)- \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right)\right).\]

Denote for brevity \(z_{i,j} = \lbrace x_{i,j}, y_{i,j}\rbrace\). Then, for any \(i \in \lbrace 1, 2, \ldots , N\rbrace , j \in \lbrace 1, 2, \ldots , m_i\rbrace\):

\begin{equation} \begin{split} \sup _{z_{1,1}, \ldots , z_{N, m_N}, z_{i,j}^{^{\prime }}} |\phi (z_{1,1},\ldots , z_{i,j}, \ldots , z_{N, m_N}) - \phi (z_{1,1}, \ldots , z_{i,j}^{^{\prime }}, \ldots , z_{N, m_N})| \le \frac{\alpha _i}{m_i}\mathcal {M} \end{split} . \end{equation}

(13)

□

Proof.

Fix any \(i, j\) and any \(z_{1,1}, \ldots , z_{N, m_N}, z_{i,j}^{^{\prime }}\). Denote the \(\alpha\)-weighted empirical average of the loss with respect to the sample \(z_{1,1}, \ldots , z_{i,j}^{^{\prime }}, \ldots , z_{N, m_N}\) by \(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}^{^{\prime }}\). Then, we have that:

\begin{align*} |\phi (\ldots , z_{i,j}, \ldots) - \phi (\ldots , z_{i,j}^{^{\prime }}, \ldots)| & = |\sup _{f\in \mathcal {H}}\left(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right)\right) - \sup _{f\in \mathcal {H}} \left(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}^{^{\prime }}\left(f\right)\right)| \\ & \le |\sup _{f\in \mathcal {H}}\left(\hat{\mathcal {L}}^{^{\prime }}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}\left(f\right)\right)| \\ & = \frac{\alpha _i}{m_i}|\sup _{f\in \mathcal {H}}\left(\ell _f(z^{\prime }_{i,j}) - \ell _f(z_{i,j})\right)| \\ & \le \frac{\alpha _i}{m_i}\mathcal {M} . \end{align*}

Note: The inequality we used above holds for bounded functions inside the supremum.□

Let \(S\) denote a random sample of size \(m\) drawn from a distribution as the one generating out data (i.e., \(m_i\) samples from \(\mathcal {D}_i\) for each \(i\)). Now, using Lemma 1, McDiarmid’s inequality gives:

\begin{equation*} \begin{split} \mathbb {P}\left(\phi (S) - \mathbb {E}(\phi (S)) \ge t\right) & \le \exp \left(-\frac{2t^2}{\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _i^2}{m_i^2}\mathcal {M}^2} \right) \\ & = \exp \left(-\frac{2t^2}{\mathcal {M}^2\sum _{i=1}^N \frac{\alpha _i^2}{m_i}}\right) . \end{split} \end{equation*}

For any \(\delta \gt 0\), setting the right-hand side above to be \(\delta /4\) and using Equation (12), we obtain that with probability at least \(1-\delta /4\):

\begin{equation} \begin{split} \mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}\left(h\right) \le \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}\left(h\right) & + \mathbb {E}_S\left(\sup _{f\in \mathcal {H}}\left(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}} (f) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}(f)\right)\right) + \sqrt {\frac{\log \left(\frac{4}{\delta }\right)\mathcal {M}^2}{2}}\sqrt {\sum _{i=1}^N\frac{\alpha _i^2}{m_i}} . \end{split} \end{equation}

(14)

To deal with the expected loss inside the second term, introduce a ghost sample (denoted by \(S^{\prime }\)), drawn from the same distributions as our original sample (denoted by \(S\)). Denoting the weighted empirical loss with respect to the ghost sample by \(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}^{^{\prime }}\), \(\beta _i = m_i/m\) for all \(i\), and using the convexity of the supremum, we obtain:

\begin{equation*} \begin{split} \mathbb {E}_S \left(\sup _{f\in \mathcal {H}}\left(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}} (f) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}(f)\right)\right) & = \mathbb {E}_{S}\left(\sup _{f\in \mathcal {H}}\left(\mathbb {E}_{S^{\prime }}\left(\hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}^{^{\prime }}(f)\right) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}(f)\right)\right) \\ & \le \mathbb {E}_{S, S^{\prime }} \left(\sup _{f\in \mathcal {H}}\left(\hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}^{^{\prime }}(f) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}(f) \right)\right) \\ & = \mathbb {E}_{S, S^{\prime }}\left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m}\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _i}{\beta _i}\left(\ell _f(z^{\prime }_{i,j}) \right. \right. \right. \left. \left. \left. - \ell _f(z_{i,j}) \vphantom{L^{^{\prime }}}\right) \vphantom{\frac{1}{m}\sum _{i=1}^N}\right)\right) . \end{split} \end{equation*}

Introducing \(m\) independent Rademacher random variables and noting that \((\ell _f(z^{\prime }) - \ell _f(z))\) and \(\sigma (\ell _f(z^{\prime }) - \ell _f(z))\) have the same distribution, as long as \(\mathbf {z}\) and \(\mathbf {z}^{\prime }\) have the same distribution:

\begin{equation*} \begin{split} \mathbb {E}_S \left(\sup _{f\in \mathcal {H}}\left(\mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}} (f) - \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}(f)\right)\right) & \le \mathbb {E}_{S, S^{\prime }, \sigma }\left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m}\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _i}{\beta _i}\sigma _{i,j}\left(\ell _f(z_{i,j})^{^{\prime }}) \right. \right. \right. \left. \left. \left. -\, \ell _f(z_{i,j}) \vphantom{L^{^{\prime }}}\right) \vphantom{\frac{1}{m}\sum _{i=1}^N}\right)\right) \\ & \le \mathbb {E}_{S^{^{\prime }}, \sigma }\left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m}\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _{i}}{\beta _{i}}\sigma _{i,j}\ell _f(z_{i,j})\right)\right) \\ &\quad + \mathbb {E}_{S, \sigma }\left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m}\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _{i}}{\beta _{i}}(-\sigma _{i,j})\ell _f(z_{i,j})\right)\right) \\ & = 2\mathbb {E}_{S, \sigma }\left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m}\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _{i}}{\beta _{i}}\sigma _{i,j}\ell _f(z_{i,j})\right)\right). \end{split} \end{equation*}

We can now link the last term to the empirical analog of the Rademacher complexity by using the McDiarmid Inequality (with an observation similar to Lemma 1). Putting this together, we obtain that for any \(\delta \gt 0\) with probability at least \(1 - \delta /2\):

\begin{equation} \begin{split} \mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}\left(h\right) & \le \hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}} \left(h\right) + 2\mathbb {E}_{\sigma }\left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m}\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _{i}}{\beta _{i}}\sigma _{i,j}\ell _f(z_{i,j})\right)\right) + 3 \sqrt {\frac{\log \left(\frac{4}{\delta }\right)M^2}{2}}\sqrt {\sum _{i=1}^N\frac{\alpha _i^2}{m_i}} . \end{split} \end{equation}

(15)

Finally, note that:

\begin{align*} \mathbb {E}_{\sigma } \left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m}\sum _{i=1}^N\sum _{j=1}^{m_i}\frac{\alpha _{i}}{\beta _{i}}\sigma _{i,j}\ell _f(z_{i,j})\right)\right) & \le \mathbb {E}_{\sigma }\left(\sum _{i=1}^{N}\alpha _i\sup _{f\in \mathcal {H}}\left(\frac{1}{m_i}\sum _{j=1}^{m_i}\sigma _{i,j}\ell _f(z_{i,j})\right)\right) \\ & = \sum _{i=1}^N \alpha _i \mathbb {E}_{\sigma }\left(\sup _{f\in \mathcal {H}}\left(\frac{1}{m_i}\sum _{j=1}^{m_i}\sigma _{i,j}\ell _f(z_{i,j})\right)\right) \\ & = \sum _{i=1}^N \alpha _i \mathcal {R}_i \left(\mathcal {H}\right) . \end{align*}

Bounding \(\hat{\mathcal {L}}_{\mathcal {D}_{\mathbf {\alpha }}}(h) - \mathcal {L}_{\mathcal {D}_{\mathbf {\alpha }}}(h)\) with the same quantity and with probability at least \(1 - \delta /2\) follows by a similar argument. The result then follows by applying the union bound.

B Proof of Theorem 2

Proof.

The Lagrangian function of Equation (6) is

\begin{equation} \mathbb {L} = \mathbf {\alpha }^\top {\hat{\mathcal {L}}}(\mathbf {w}) + \frac{\lambda }{2} || \mathbf {\alpha }^{\top } \mathbf {m}^{\circ - \frac{1}{2}} ||^2_2 - \mathbf {\alpha }^{\top } \mathbf {\beta } - \eta (\mathbf {\alpha }^{\top } \mathbf {1} - 1), \end{equation}

(16)

where \(\hat{\mathcal {L}}(\mathbf {w}) = [\hat{\mathcal {L}}_1(\mathbf {w}),\hat{\mathcal {L}}_2(\mathbf {w}),\ldots , \hat{\mathcal {L}}_N(\mathbf {w})]^\intercal\), \(\circ\) is the Hadamard root operation, \(\mathbf {\beta }\) and \(\eta\) are the Lagrangian multipliers. Then, the following Karush-Kuhn-Tucker (KKT) conditions hold:

\begin{align} \partial _{\mathbf {\alpha }} \mathbb {L}(\mathbf {\alpha }, \mathbf {\beta }, \eta) &= 0 , \end{align}

(17)

\begin{align} \mathbf {\alpha }^\intercal \mathbf {1} - 1 &= 0, \end{align}

(18)

\begin{align} \mathbf {\alpha } &\ge 0, \end{align}

(19)

\begin{align} \mathbf {\beta } &\ge 0, \end{align}

(20)

\begin{align} \alpha _i \beta _i &= 0 , \forall i = 1, 2,\ldots N. \end{align}

(21)

According to Equation (17), we have:

\begin{equation} \alpha _i = \frac{m_i(\beta _i + \eta - \hat{\mathcal {L}}_i(\mathbf {w}))}{\lambda }. \end{equation}

(22)

Since \(\beta _i \ge 0\), we discuss the following cases:

(1)

When \(\beta _i = 0\), we have \(\alpha _i = \frac{m_i(\eta - \hat{\mathcal {L}}_i(\mathbf {w}))}{\lambda } \ge 0\). Note that we further have \(\eta - \hat{\mathcal {L}}_i(\mathbf {w}) \ge 0\).

(2)

When \(\beta _i \gt 0\), from the condition \(\alpha _i \beta _i = 0\), we have \(\alpha _i = 0\).

Therefore, the optimal solution to Equation (6) is given by:

\begin{equation} \alpha _i(\mathbf {w}) = \left[\frac{m_i (\eta - \hat{\mathcal {L}}_i(\mathbf {w}))}{\lambda }\right]_{+}, \end{equation}

(23)

where \([\cdot ]_+ = max(0, \cdot)\).

We notice that \(\sum _{i=1}^p \alpha _i = 1\), thus we can get:

\begin{equation} \eta = \frac{\sum _{i=1}^{p} m_i \hat{\mathcal {L}}_i(\mathbf {w}) + \lambda }{\sum _{i=1}^{p} m_i}. \end{equation}

(24)

According to \(\eta - \hat{\mathcal {L}}_i(\mathbf {w}) \ge 0\), we have Equations (7) and (8). Finally, plugging Equation (24) into Equation (23) yields Equation (9).□

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

Abstract

A Proof of Theorem 1

B Proof of Theorem 2

References

Cited By

Index Terms

Recommendations

RobustFed: A Truth Inference Approach for Robust Federated Learning

Byzantine-Robust Aggregation for Federated Learning with Reinforcement Learning

Unsupervised perturbation based self-supervised federated adversarial training: Unsupervised perturbation based self-supervised federated adversarial training

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Share

Share this Publication link

Share on social media

Affiliations