Icml Zeno Appendix
Icml Zeno Appendix
Icml Zeno Appendix
001
002 Zeno: Distributed Stochastic Gradient Descent with Suspicion-based
003
004
Fault-tolerance
005
006
007
Anonymous Authors1
008
009
010 Abstract is also possible that in different iterations, different groups
011 of workers are faulty, which means that we can not simply
012 We present Zeno, a technique to make distributed
machine learning, particularly Stochastic Gra- identify any worker which is always faulty.
013
014 dient Descent (SGD), tolerant to an arbitrary
015 number of faulty workers. This generalizes pre- 4: Aggregation
016 vious results that assumed a majority of non-
faulty nodes; we need assume only one non-faulty Server
017
worker. Our key idea is to suspect workers that are
1: Pull
3: Push
018
019 potentially defective. Since this is likely to lead to
020 false positives, we use a ranking-based preference
mechanism. We prove the convergence of SGD
Worker Worker ... Worker
021 2: Gradient Computation
022 for non-convex problems under these scenarios.
023 Experimental results show that Zeno outperforms
024 existing approaches. Figure 1. Parameter Server architecture.
025
026 We focus on the problem of Stochastic Gradient Descent
027
1. Introduction
(SGD). We use the Parameter Server (PS) architecture (Li
028 In distributed machine learning, one of the hardest problems et al., 2014a;b) for distributed SGD. As illustrated in Figure
029 today is fault-tolerance. Faulty workers may take arbitrary 1, processes are composed of the server nodes and worker
030 actions or modify their portion of the data and/or models nodes. In each SGD iteration, the workers pull the latest
031 arbitrarily. In addition to adversarial attacks on purpose, it is model from the servers, estimate the gradients using the
032 also common for the workers to have hardware or software locally sampled training data, then push the gradient esti-
033 failures, such as bit-flipping in the memory or communi- mators to the servers. The servers aggregate the gradient
034 cation media. While fault-tolerance has been studied for estimators, and update the model by using the aggregated
035 distributed machine learning (Blanchard et al., 2017; Chen gradients.
036 et al., 2017; Yin et al., 2018; Feng et al., 2014; Su & Vaidya,
Our approach, in a nutshell is the following. We treat each
037 2016a;b; Alistarh et al., 2018), much of the work on fault-
candidate gradient estimator as a suspect. We compute a
038 tolerant machine learning makes strong assumptions. For
score using a stochastic zero-order oracle. This ranking
039 instance, a common assumption is that no more than 50% of
indicates how trustworthy the given worker is. Then, we
040 the workers are faulty (Blanchard et al., 2017; Chen et al.,
take the average over the several candidates with the highest
041 2017; Yin et al., 2018; Su & Vaidya, 2016a; Alistarh et al.,
scores. This allows us to tolerate a large number of incorrect
042 2018).
gradients. We prove that the convergence is as fast as fault-
043
We present Zeno, a new technique that generalizes the fail- free SGD. The variance falls as the number of non-faulty
044
ure model so that we only require at least one non-faulty workers increases.
045
046 (good) worker. In particular, faulty gradients may pretend to
To the best of our knowledge this paper is the first to the-
047 be good by behaving similar to the correct gradients in the
oretically and empirically study cases where a majority of
048 variance and magnitude, making them hard to distinguish. It
workers are faulty for non-convex problems. In summary,
049 1
Anonymous Institution, Anonymous City, Anonymous Region, our contributions are:
050 Anonymous Country. Correspondence to: Anonymous Author
051 <anon.email@domain.com>.
• A new approach for SGD with fault-tolerance, that works
052 with an arbitrarily large number of faulty nodes as long
Preliminary work. Under review by the International Conference
053 on Machine Learning (ICML). Do not distribute. as there is at least one non-faulty node.
054
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
055 • Theoretically, the proposed algorithm converges as fast We assume that there exists a minimizer of F (x), which is
056 as distributed synchronous SGD without faulty workers, denoted by x∗ .
057 with the same asymptotic time complexity.
We solve this problem in a distributed manner with m work-
058
• Experimental results validating that 1) existing majority- ers. In each iteration, each worker will sample n indepen-
059
based robust algorithms may fail even when the number dent and identically distributed (i.i.d.) data points from the
060
of faulty workers is lower than the majority, and 2) Zeno distribution D, and compute the gradient of the local empir-
061 Pn
gracefully handles such cases. ical loss Fi (x) = n1 j=1 f (x; z i,j ), ∀i ∈ [m], where z i,j
062
is the jth sampled data on the ith worker. The servers will
063
• The effectiveness of Zeno also extends to the case where collect and aggregate the gradients sent by the works, and
064
the workers use disjoint local data to train the model, update the model as follows:
065
i.e., the local training data are not identically distributed
066 xt+1 = xt − γ t Aggr({gi (xt ) : i ∈ [m]}),
across different workers. Theoretical and experimental
067
analysis is also provided in this case.
068 where Aggr(·) is an aggregation rule (e.g., averaging), and
069
2. Related work
(
070 ∗ ith worker is faulty,
t
071 gi (x ) = t
(1)
Many approaches for improving failure tolerance are based ∇Fi (x ) otherwise,
072
073 on robust statistics. For instance, Chen et al. (2017); Su &
Vaidya (2016a;b) use geometric median as the aggregation where “∗" represents arbitrary values.
074
075 rule. Yin et al. (2018) establishes statistical error rates for Formally, we define the failure model in synchronous SGD
076 marginal trimmed mean as the aggregation rule. Similar as follows.
077 to these papers, our proposed algorithm also works under
Byzantine settings. Definition 1. (Failure Model). In the tth iteration, let
078 {vit : i ∈ [m]} be i.i.d. random vectors in Rd , where
079 There are also robust gradient aggregation rules that are not vit = ∇Fi (xt ). The set of correct vectors {vit : i ∈ [m]}
080 based on robust statistics. For example, Blanchard et al. is partially replaced by faulty vectors, which results in
081 (2017) propose Krum, which select the candidates with {ṽit : i ∈ [m]}, where ṽit = gi (xt ) as defined in Equa-
082 minimal local sum of Euclidean distances. DRACO (Chen tion (1). In other words, a correct/non-faulty gradient is
083 et al., 2018) uses coding theory to ensure robustness. ∇Fi (xt ), while a faulty gradient, marked as “∗", is as-
084 signed arbitrary value. We assume that q out of m vectors
085 Alistarh et al. (2018) proposes a fault-tolerant SGD variant
different from the robust aggregation rules. The algorithm are faulty, where q < m. Furthermore, the indices of faulty
086 workers can change across different iterations.
087 utilizes the historical information, and achieves the optimal
088 sample complexity. However, the algorithm requires the
We observe that in the worst case, the failure model in Defi-
089 estimated upper bound of the variances of the stochastic
nition 1 is equivalent to the Byzantine failures introduced in
090 gradients, which makes the algorithm less practical. Fur-
Blanchard et al. (2017); Chen et al. (2017); Yin et al. (2018).
091 thermore, there are no empirical results provided.
In particlar, if the failures are caused by attackers, the faliur
092 In summary, the existing majority-based methods for syn- model includes the case where the attackers can collude.
093 chronous SGD (Blanchard et al., 2017; Chen et al., 2017;
094 To help understand the failure model in synchronous SGD,
Yin et al., 2018; Su & Vaidya, 2016a; Alistarh et al., 2018)
095 we illustrate a toy example in Figure 2.
assume that the non-faulty workers dominate the entire set
096 of workers. Thus, such algorithms can trim the outliers from The notations used in this paper is summarized in Table 1.
097 the candidates. However, in real-world failures or attacks,
098 there are no guarantees that the number of faulty workers Table 1. Notations
099 Notation Description
can be bounded from above. m Number of workers
100
n Number of samples on each worker
101 T Number of epochs
102
3. Model
[m] Set of integers {1, . . . , m}
103 We consider the following optimization problem: q Number of faulty workers
104 b Trim parameter of Zeno
105 min F (x), γ Learning rate
106 x∈Rd ρ Regularization weight of Zeno
nr Batch size of Zeno
107 where F (x) = Ez∼D [f (x; z)], z is sampled from some k·k All the norms in this paper are l2 -norms
108 unknown distribution D, d is the number of dimensions.
109
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
189 SGD with Zeno as the aggregation rule under our failure Zenob ({ṽi : i ∈ [m]}). Taking γ ≤ L1 , ρ = βγ2 , and
190 model. We start with the assumptions required by the con- β > max(0, −µ), we have
191 vergence guarantees. The two basic assumptions are the
¯ − F (x) ≤ − γ k∇F (x)k2
E [F (x − γ ṽ)]
192 smoothness of the loss function, and the bounded variance 2
193 of the (non-faulty) gradient estimators. γ(b − q + 1)(m − q)V (L + β)γ 2 G
194 + + .
(m − b)2 2
195 5.1. Assumptions
2
196 Corollary 1. Take γ = L√ 1
, ρ = βγ2 , and β >
In this section, we highlight the necessary assumption for T
197
stochastic descendant score, followed by the assumptions max(0, −µ). Using Zeno, with E[∇Fi (xt )] = ∇F (xt )
198
for convergence guarantees. for ∀t ∈ {0, . . . , T }, after T iterations, we have
199
200 Assumption 1. (Unbiased evaluation) We assume that the PT −1 t 2
stochastic loss function, fr (x), evaluated in the stochastic t=0 Ek∇F (x )k
201
202 descendant score in Definition 2, is an unbiased estimator of T
the global loss function F (x). In other words, E[fr (x)] = 1 (b − q + 1)(m − q)
203 ≤O √ +O .
204 F (x). T (m − b)2
205
Note that we do not make any assumption for the Zeno Now, we consider a more general case, where each worker
206
batch size nr or the variance of fr (x). has disjoint (non-identically distributed) local dataset for
207
208 Assumption 2. (Bounded Taylor’s Approximation) We as- training, which results in non-identically distributed gradient
209 sume that f (x; z) has L-smoothness and µ-lower-bounded estimators. The server is still aware of the the entire dataset.
210 Taylor’s approximation (also called µ-weak convexity): For example, in volunteer computing (Meeds et al., 2015;
211 µ Miura & Harada, 2015), the server/coordinator can assign
212 h∇f (x; z), y − xi + ky − xk2 ≤ f (y; z) − f (x; z) disjoint tasks/subsets of training data to the workers, while
2
213 the server holds the entire training dataset. In this scenario,
L
214 ≤ h∇f (x; z), y − xi + ky − xk2 , we have the following convergence guarantee.
2
215 Corollary 2. Assume that
216 where µ ≤ L, and L > 0.
1 X
217 Note that Assumption 2 covers the case of non-convexity by F (x) = E [Fi (x)] , E [Fi (x)] 6= E [Fj (x)] ,
218 m
taking µ < 0, non-strong convexity by taking µ = 0, and i∈[m]
219
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
220 for ∀i, j ∈ [m], i 6= j. For the stochastic descendant score, • Zeno also works when training with disjoint local data.
221 we still have E [fr (x)] = F (x). Assumption 1, 2, and 3
2
222 still hold. Take γ = L√ 1
, ρ = βγ2 , and β > max(0, −µ). 6.1. Datasets and Evaluation Metrics
T
223 Using Zeno, after T iterations, we have We conduct experiments on benchmark CIFAR-10 image
224 PT −1 t 2 classification dataset (Krizhevsky & Hinton, 2009), which
225 t=0 Ek∇F (x )k
is composed of 50k images for training and 10k images for
226
T 2 testing. We use convolutional neural network (CNN) with
227 1 b b (m − q)
≤O √ +O +O . 4 convolutional layers followed by 1 fully connected layer.
228 T m m2 (m − b) The detailed network architecture can be found in our sub-
229
mitted source code (will also be released upon publication).
230 These two corollaries tell us that when using Zeno as the In each experiment, we launch 20 worker processes. We
231 aggregation rule, even if there are failures, the convergence repeat each experiment 10 times and take the average. We
232 rate can be as fast as fault-free distributed synchronous use top-1 accuracy on the testing set and the cross-entropy
233 SGD. The variance decreases when the number of workers loss function on the training set as the evaluation metrics.
234 m increases, or the estimated number of faulty workers b
235 decreases. 6.1.1. BASELINES
236 Remark 2. There are two practical concerns for the pro-
237 posed algorithm. First, by increasing the batch size of We use the averaging without failures/attacks as the
238 fr (·) (nr in Definition 2), the stochastic descendant score gold standard, which is referred to as Mean without
239 will be potentially more stable. However, according to The- failures. Note that this method is not affected by b or
240 orem 1 and Corollary 1 and 2, the convergence rate is inde- q. The baseline aggregation rules are Mean, Median, and
241 pendent of the variance of fr . Thus, theoretically we can use Krum as defined below.
242 a single sample to evaluate the stochastic descendant score. Definition 4. (Median (Yin et al., 2018)) We define the
243 Second, theoretically we need larger ρ for non-convex prob- marginal median aggregation rule Median(·) as med =
244 lems. However, larger ρ makes Zeno less sensitive to the Median({ṽi : i ∈ [m]}), where for any j ∈ [d], the jth di-
245 descendant of the loss function, which potentially increases mension of med is medj = median ({(ṽ1 )j , . . . , (ṽm )j }),
246 the risk of aggregating harmful candidates. In practice, we (ṽi )j is the jth dimension of the vector ṽi , median(·) is the
247 can use a small ρ by assuming the local convexity of the loss one-dimensional median.
248 functions. Definition 5. (Krum (Blanchard et al., 2017))
249 X
250 5.3. Implementation Details: Time Complexity Krumb ({ṽi : i ∈ [m]}) = ṽk , k = argmin kṽi − ṽj k2 ,
251 i∈[m] i→j
252 Unlike the majority-based aggregation rules, the time com-
where i → j is the indices of the m − b − 2 nearest neigh-
253 plexity of Zeno is not trivial to analyze. Note that the
bours of ṽi in {ṽi : i ∈ [m]} measured by Euclidean dis-
254 convergence rate is independent of the variance of fr , which
tances.
255 means that we can use a single sample (nr = 1) to evalu-
256 ate fr to achieve the same convergence rate. Furthermore, Note that Krum requires 2b + 2 < m. Thus, b = 8 is the
257 in general, when evaluating the loss function on a single best we can take.
258 sample, the time complexity is roughly linear to the num-
259 ber of parameters d. Thus, informally, the time complexity 6.2. No Failure
260 of Zeno is O(dm) for one iteration, which is the same as
261 Mean and Median aggregation rules. Note that the time We first test the convergence when there are no failures. In
262 complexity of Krum is O(dm2 ). all the experiments, we take the learning rate γ = 0.1,
263 worker batch size 100, Zeno batch size nr = 4, and
264 ρ = 0.0005. Each worker computes the gradients on i.i.d.
6. Experiments samples. For both Krum and Zeno, we take b = 4. The
265
266 In this section, we evaluate the fault tolerance of the pro- result is shown in Figure 4. We can see that Zeno converges
267 posed algorithm. We summarize our results here: as fast as Mean. Krum converges slightly slower, but the
268 convergence rate is acceptable.
269 • Compared to the baselines, Zeno shows better conver-
270 gence with more faulty workers than non-faulty ones. 6.3. Label-flipping Failure
271 • Zeno is robust to the choices of the hyperparameters, In this section, we test the fault tolerance to the label-flipping
272 including the Zeno batch size nr , the weight ρ, and the failures. When such kind of failures happen, the work-
273 number of trimmed elements b. ers compute the gradients based on the training data with
274
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
0.6
1.5
Loss
278 0.4
1
279 Mean
Median
0.2 0.5 Krum 4
280 Zeno 4
281 0
100 101 102
0
100 101 102
Epoch Epoch
282
283 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
284
285 Figure 4. Convergence on i.i.d. training data, without failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
286 ρ = 0.0005. γ = 0.1. Each epoch has 25 iterations. Zeno performs similar to Mean.
287
2.5
288 0.8
Mean without failures
Mean
289 Median
Krum 8
2
Top-1 accuracy
0.6
290 Zeno 9
1.5
291
Loss
0.4
1 Mean without failures
292 Mean
Median
293 0.2 0.5 Krum 8
294 0 0
Zeno 9
0.6
Zeno 16
1.5
301
Loss
0.4
302 1 Mean without failures
Mean
303 0.2
Median
0.5 Krum 8
304 Zeno 16
305 0
100 101 102
0
100 101 102
306 Epoch Epoch
307 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
308
309 Figure 5. Convergence on i.i.d. training data, with label-flipping failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
310 ρ = 0.0005. γ = 0.1. Each epoch has 25 iterations. Zeno outperforms all the baselines, especially when q = 12.
311
312
“flipped" labels, i.e., any label ∈ {0, . . . , 9}, is replaced by 6.4. Bit-flipping Failure
313
9 − label. Such kind of failures/attacks can be caused by
314 In this section, we test the fault tolerance to a more severe
data poisoning or software failures.
315 kind of failures. In such failures, the bits that controls the
316 In all the experiments, we take the learning rate γ = 0.1, sign of the floating numbers are flipped, due to some hard-
317 worker batch size 100, Zeno batch size nr = 4, and ρ = ware failure. A faulty worker pushes the negative gradient
318 0.0005. Each non-faulty worker computes the gradients on instead of the true gradient to the servers. To make the fail-
319 i.i.d. samples. ure even worse, one of the faulty gradients is copied to and
320 overwrites the other faulty gradients, which means that all
The result is shown in Figure 5. As expected, Zeno can
321 the faulty gradients have the same values.
tolerate more than half faulty gradients. When q = 8, Zeno
322
preforms similar to Krum. When q = 12, Zeno preforms In all the experiments, we take the learning rate γ = 0.1,
323
much better than the baselines. When there are faulty gra- worker batch size 100, Zeno batch size nr = 4, and ρ =
324
dients, Zeno converges slower, but still have better conver- 0.0005. Each non-faulty worker computes the gradients on
325
gence rates than the baselines. i.i.d. samples.
326
327 The result is shown in Figure 6. As expected, Zeno can
328 tolerate more than half faulty gradients. Surprisingly, Mean
329
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
0.6
Zeno 9
333 1.5
Loss
0.4
334 1 Mean without failures
Mean
335 0.2
Median
0.5 Krum 8
336 Zeno 9
337 0
100 101 102
0
100 101 102
338 Epoch Epoch
339 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
340
Mean without failures 2.5
341 0.8 Mean
Median
342 Krum 8
2
Top-1 accuracy
0.6
343 Zeno 16
1.5
Loss
344 0.4
Mean without failures
1
345 Mean
Median
0.2
346 0.5 Krum 8
Zeno 16
347 0 0
100 10 1
10 2
100 101 102
348 Epoch Epoch
349 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
350
351 Figure 6. Convergence on i.i.d. training data, with bit-flipping failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
352 ρ = 0.0005. γ = 0.1. Each epoch has 25 iterations. Zeno outperforms all the baselines, especially when q = 12.
353
354 Mean without failures 2.5
0.8
355 Mean
Median 2
356 Krum 8
Top-1 accuracy
0.6
Zeno 9
357 1.5
Loss
358 0.4
1 Mean without failures
Mean
359 Median
0.2 0.5 Krum 8
360 Zeno 9
361 0 0
100 10 1
10 2
100 101 102
362 Epoch Epoch
363 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
364
2.5
365 0.8
Mean without failures
Mean
366 Median
Krum 8
2
Top-1 accuracy
0.6
367 Zeno 16
1.5
Loss
368 0.4
1 Mean without failures
369 Mean
Median
370 0.2 0.5 Krum 8
371 0 0
Zeno 16
372 100 10 1
10 2
100 101 102
Epoch Epoch
373
(c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
374
375
Figure 7. Convergence on disjoint (non-i.i.d.) training data, with label-flipping failures. Batch size on the workers is 100. Batch size of
376 Zeno is nr = 4. ρ = 0.0005. γ = 0.05. Each epoch has 25 iterations. Zeno outperforms all the baselines, especially when q = 12.
377
378
379 performs well when q = 8. We will discuss this phe- 6.5. Disjoint Local Training Data
380 nomenon in Section 6.7. Zeno outperforms all the baselines.
381 In volunteer computing (Meeds et al., 2015; Miura &
When q = 12, Zeno is the only one avoiding catastrophic
382 Harada, 2015), it is reasonable for the coordinator to as-
divergence. Zeno converges slower, but still have better
383 sign disjoint tasks/datasets to different workers. As a result,
convergence than the baselines.
384
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
385 0.8
0.8
386
0.6
387 0.6
Top-1 accuracy
Loss
388 0.4
0.4
440 References
441
Alistarh, D., Allen-Zhu, Z., and Li, J. Byzantine stochastic
442
gradient descent. arXiv preprint arXiv:1803.08917, 2018.
443
444 Blanchard, P., Guerraoui, R., Stainer, J., et al. Machine
445 learning with adversaries: Byzantine tolerant gradient
446 descent. In Advances in Neural Information Processing
447 Systems, pp. 118–128, 2017.
448
449 Chen, L., Wang, H., Charles, Z., and Papailiopoulos, D.
450 Draco: Byzantine-resilient distributed training via redun-
451 dant gradients. In International Conference on Machine
452 Learning, pp. 902–911, 2018.
453
Chen, Y., Su, L., and Xu, J. Distributed statistical machine
454
learning in adversarial settings: Byzantine gradient de-
455
scent. POMACS, 1:44:1–44:25, 2017.
456
457 Feng, J., Xu, H., and Mannor, S. Distributed robust learning.
458 arXiv preprint arXiv:1409.5937, 2014.
459
460 Krizhevsky, A. and Hinton, G. Learning multiple layers of
461 features from tiny images. 2009.
462
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,
463
A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.
464
Scaling distributed machine learning with the parameter
465
server. In OSDI, volume 14, pp. 583–598, 2014a.
466
467 Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Com-
468 munication efficient distributed machine learning with
469 the parameter server. In Advances in Neural Information
470 Processing Systems, pp. 19–27, 2014b.
471
472 Meeds, E., Hendriks, R., al Faraby, S., Bruntink, M., and
473 Welling, M. Mlitb: machine learning in the browser.
474 PeerJ Computer Science, 1, 2015.
475
Miura, K. and Harada, T. Implementation of a practical dis-
476
tributed calculation system with browsers and javascript,
477
and application to distributed deep learning. CoRR,
478
abs/1503.05743, 2015.
479
480 Su, L. and Vaidya, N. H. Fault-tolerant multi-agent opti-
481 mization: Optimal iterative distributed algorithms. In
482 PODC, 2016a.
483
484 Su, L. and Vaidya, N. H. Defending non-bayesian
485 learning against adversarial attacks. arXiv preprint
486 arXiv:1606.08883, 2016b.
487
Xie, C., Koyejo, O., and Gupta, I. Phocas: dimensional
488
byzantine-resilient stochastic gradient descent. arXiv
489
preprint arXiv:1805.09682, 2018.
490
491 Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P.
492 Byzantine-robust distributed learning: Towards optimal
493 statistical rates. arXiv preprint arXiv:1803.01498, 2018.
494
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
495
496
497
498
Appendix
499
500
501
A. Proofs
502 A.1. Preliminaries
503
504 We use the following lemma to bound the aggregated vectors.
505 Lemma 1. (Bounded Score) Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as
506 {vi : i ∈ [m − q]}. Sorting the correct vectors by the stochastic descendant score, we obtain {v(i) : i ∈ [m − q]}. Then, we
507 have the following inequality:
508
509 Scoreγ,ρ (ṽ(i) , x) ≥ Scoreγ,ρ (v(i) , x), ∀i ∈ [m − q],
510
511 or, by flipping the signs on both sides, it is equivalent to
512
513 fr (x − γṽ(i) ) − fr (x) + ρkṽ(i) k2 ≤ fr (x − γv(i) ) − fr (x) + ρkv(i) k2 , ∀i ∈ [m − q],
514
515 Proof. We prove the lemma by contradiction.
516 Assume that Scoreγ,ρ (ṽ(i) , x) < Scoreγ,ρ (v(i) , x). Thus, there are i correct vectors having greater scores than ṽ(i) .
517 However, because ṽ(i) is the ith element in {ṽ(i) : i ∈ [m]}, there should be at most i − 1 vectors having greater scores than
518 it, which yields a contradiction.
519
520 A.2. Convergence guarantees
521
522 For general non-strongly convex functions and non-convex functions, we provide the following convergence guarantees.
523 Theorem 1. For ∀x ∈ Rd , denote
524 (
525 ∗ ith worker is Byzantine,
526 ṽi =
∇Fi (x) otherwise,
527
528 βγ 2
529 where i ∈ [m], and ṽ¯ = Zenob ({ṽi : i ∈ [m]}). Taking γ ≤ 1
L, and ρ = 2 , where
530 (
531 β = 0, if µ ≥ 0;
532 β ≥ |µ|, otherwise.
533
534 we have
535 2
536 ¯ − F (x) ≤ − γ k∇F (x)k2 + γ(b − q + 1)(m − q)V + (L + β)γ G .
E [F (x − γ ṽ)]
2 (m − b) 2 2
537
538
539 Proof. Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as {vi : i ∈ [m − q]}, where
540 E[vi ] = ∇F (x). Sorting the correct vectors by the online descendant score, we obtain {v(i) : i ∈ [m − q]}. We also sort ṽi
541 by the online descendant score and obtain {ṽ(i) : i ∈ [m]}.
542 According to the definition, ṽ¯ = Zenob ({ṽi : i ∈ [m]}) = m−b
1
Pm−b 1
Pm−b
i=1 ṽ(i) . Furthermore, we denote v̄ = m−b i=1 v(i) .
543
544 Using Assumption 2, we have
545 2
546 ¯ γ(ṽ¯ − ṽ(i) ) + µγ kṽ¯ − ṽ(i) k2 ,
¯ + ∇fr (x − γ ṽ),
fr (x − γṽ(i) ) ≥ fr (x − γ ṽ)
547 2
548 for ∀i ∈ [m − b].
549
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
660 Now, taking the expectation w.r.t. ṽ(i) ’s on both sides and using Ekv(i) k2 ≤ G, we have
661
¯ − F (x)
E [F (x − γ ṽ)]
662
663 γ γ
m−b
(L + β)γ 2 X
664 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + Ekv(i) k2
2 2 2(m − b) i=1
665
666 γ γ (L + β)γ 2 G
≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + .
667 2 2 2
668
669 Now we just need to bound Ek∇F (x) − v̄k2 . For convenience, we denote g = ∇F (x). Note that for arbitrary subset
670 S ⊆ [m − q], |S| = m − b, we have the following bound:
671
i∈S (vi − g)
2
P
672
E
673
m−b
674
P P
2
i∈[m−q] (vi − g) − i∈S / (vi − g)
675 = E
m−b
676
677
P
2
2
i∈[m−q] (vi − g)
P
/ (vi − g)
678 ≤ 2E
+ 2E
i∈S
m−b m−b
679
680
P
2
2
i∈[m−q] (vi − g)
P
681 2(m − q)2
2(b − q)2
/ (vi − g)
i∈S
= 2
E + 2
E
(m − b) m−q (m − b) b−q
682
683 P 2
2(m − q)2 V 2(b − q)2 i∈[m−q] kvi − gk
684 ≤ +
685 (m − b)2 m − q (m − b)2 b−q
2 2
686 2(m − q) V 2(b − q) (m − q)V
≤ +
687 (m − b)2 m − q (m − b)2 b − q
688 2(b − q + 1)(m − q)V
689 = .
(m − b)2
690
691
Putting all the ingredients together, we obtain the desired result
692
693 ¯ − F (x)
E [F (x − γ ṽ)]
694
γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
695 ≤ − k∇F (x)k2 + 2
+ .
696 2 (m − b) 2
697
698
699 1 βγ 2
Corollary 1. Take γ = √
L T
, and ρ = 2 , where β is the same as in Theorem 1. Using Zeno, after T iterations, we have
700
701 PT −1
Ek∇F (xt )k2
t=0
702
T
703
(L + β)G 1 2(b − q + 1)(m − q)V
704 ≤ 2L F (x0 ) − F (x∗ ) + √ +
705 L T (m − b)2
706 1 (b − q + 1)(m − q)
=O √ +O .
707 T (m − b)2
708
709 Proof. Taking x = xt , x − γZenob ({ṽi : i ∈ [m]}) = xt+1 , using Theorem 1, we have
710
E F (xt+1 ) − F (xt )
711
712 γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
713 ≤ − k∇F (xt )k2 + + .
2 (m − b)2 2
714
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
770 Now we just need to bound Ek∇F (x) − v̄k2 . We define that S1 = {∇Fi (x) : i ∈ [m]} \ {v(i) : i ∈ [m − b]}. Note that
771 |S1 | = b.
772
773 Ek∇F (x) − v̄k2
774
m−b
2
775
1 X 1 X
= E
E [∇Fi (x)] − v(i)
776 m
i∈[m] m − b i=1
777
2
2
778
m−b
1 m−b m−b
1 X 1 X
X 1 X
≤ 2E
[∇F (x)] − v + 2E v − v
779 E i (i)
(i) (i)
m m m m − b
780
i∈[m] i=1
i=1 i=1
781
2
2
1 X
2
m
m−b m−b
782
1 X 1 X
1 X 1 X
≤ 4E
[∇F (x)] − ∇F (x)
+ 4E
v + 2E v − v
E i i (i) (i)
m m m m m − b
783
i∈[m] i=1
v∈S1
i=1 i=1
784
2
m−b
2
m−b
785 4V 4b2
1 X
1 X 1 X
≤ + 2E
v
+ 2E
v(i) − v(i)
786 m m
b
m m − b
v∈S1 i=1 i=1
787
2
2
788 4V 4b2 mG
1 1
m−b
X
≤ + 2 +2 − E
v(i)
789
m m b m m−b
790 i=1
2
791
4V 4bG b
792 ≤ + +2 (m − b)(m − q)G
m m m(m − b)
793
4V 4bG 2b2 (m − q)G
794 ≤ + + .
795 m m m2 (m − b)
796
797 Thus, we have
798
¯ − F (x)
E [F (x − γ ṽ)]
799
4bG 2b2 (m − q)G (L + β)γ 2 G
800 γ 2 γ 4V
≤ − k∇F (x)k + + + + .
801 2 2 m m m2 (m − b) 2
802
803 1
Follow the same procedure in Corollary 1, taking γ = √
L T
, we have
804
805 PT −1
806 t=0Ek∇F (xt )k2
807 T
2L F (x0 ) − F (x∗ )
808 4V 4bG 2b2 (m − q)G (L + β)G
≤ √ + + + + √
809 T m m m2 (m − b) L T
810
1
b
2
b (m − q)
811 =O √ +O +O .
T m m2 (m − b)
812
813
814
815
816 B. Additional Experiments
817
818
819
820
821
822
823
824
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
825
826
827
828 Mean 2.5
829 0.8 Median
Krum 4 2
830
Zeno 4
Top-1 accuracy
0.6
831
1.5
832
Loss
0.4
833 1
Mean
834 Median
0.2 0.5
835 Krum 4
Zeno 4
836 0 0
837 100 10 1
10 2
100 101 102
Epoch Epoch
838
839 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
840
841 Figure 9. Convergence on non-i.i.d. training data, without failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
842 ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
843
844
845
846
847
848
849
Mean without failures 2.5
850 0.8 Mean
851 Median 2
Krum 8
852
Top-1 accuracy
0.6
Zeno 9
853 1.5
Loss
854 0.4
Mean without failures
1
855 Mean
Median
856 0.2 0.5 Krum 8
857 Zeno 9
0 0
858 100 101 102 100 101 102
859 Epoch Epoch
860 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
861
2.5
862 0.8
Mean without failures
Mean
863 Median 2
Krum 8
864
Top-1 accuracy
0.6
Zeno 16
865 1.5
Loss
866 0.4
1 Mean without failures
867 Mean
Median
868 0.2 0.5 Krum 8
869 Zeno 16
870 0 0
100 10 1
10 2
100 101 102
871 Epoch Epoch
872 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
873
874 Figure 10. Convergence on non-i.i.d. training data, with label-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
875 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
876
877
878
879
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
880
881
882
883 Mean without failures 2.5
884 0.8 Mean
Median 2
885 Krum 8
Top-1 accuracy
0.6
886 Zeno 9
1.5
887
Loss
0.4
888 1 Mean without failures
Mean
889 Median
0.2 0.5
890 Krum 8
Zeno 9
891 0 0
892 100 10 1
10 2
100 101 102
Epoch Epoch
893
894 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
895 2.5
Mean without failures
896 0.8 Mean
897 Median 2
Krum 8
898
Top-1 accuracy
0.6
Zeno 16
1.5
899 Loss
900 0.4
1 Mean without failures
901 Mean
Median
0.2
902 0.5 Krum 8
903 Zeno 16
0 0
904 100 10 1
10 2
100 101 102
905 Epoch Epoch
906 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
907
908 Figure 11. Convergence on non-i.i.d. training data, with bit-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
909 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
910
911
912
913
914
915
916
917 0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18
0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18
918
919
Top-1 accuracy
0.6 0.6
920
Loss