Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding

Yu, Qian; Maddah-Ali, Mohammad Ali; Avestimehr, A. Salman

doi:10.1109/TIT.2019.2963864

Computer Science > Information Theory

arXiv:1801.07487 (cs)

[Submitted on 23 Jan 2018 (v1), last revised 9 Apr 2020 (this version, v5)]

Title:Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding

Authors:Qian Yu, Mohammad Ali Maddah-Ali, A. Salman Avestimehr

View PDF

Abstract:We consider the problem of massive matrix multiplication, which underlies many data analytic applications, in a large-scale distributed system comprising a group of worker nodes. We target the stragglers' delay performance bottleneck, which is due to the unpredictable latency in waiting for slowest nodes (or stragglers) to finish their tasks. We propose a novel coding strategy, named \emph{entangled polynomial code}, for designing the intermediate computations at the worker nodes in order to minimize the recovery threshold (i.e., the number of workers that we need to wait for in order to compute the final output). We demonstrate the optimality of entangled polynomial code in several cases, and show that it provides orderwise improvement over the conventional schemes for straggler mitigation. Furthermore, we characterize the optimal recovery threshold among all linear coding strategies within a factor of $2$ using \emph{bilinear complexity}, by developing an improved version of the entangled polynomial code. In particular, while evaluating bilinear complexity is a well-known challenging problem, we show that optimal recovery threshold for linear coding strategies can be approximated within a factor of $2$ of this fundamental quantity. On the other hand, the improved version of the entangled polynomial code enables further and orderwise reduction in the recovery threshold, compared to its basic version. Finally, we show that the techniques developed in this paper can also be extended to several other problems such as coded convolution and fault-tolerant computing, leading to tight characterizations.

Subjects:	Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:1801.07487 [cs.IT]
	(or arXiv:1801.07487v5 [cs.IT] for this version)
	https://doi.org/10.48550/arXiv.1801.07487
Journal reference:	Published in: IEEE Transactions on Information Theory (Jan. 2020)
Related DOI:	https://doi.org/10.1109/TIT.2019.2963864

Submission history

From: Qian Yu [view email]
[v1] Tue, 23 Jan 2018 11:34:59 UTC (489 KB)
[v2] Sat, 5 May 2018 08:39:27 UTC (479 KB)
[v3] Wed, 18 Dec 2019 12:50:21 UTC (390 KB)
[v4] Wed, 22 Jan 2020 10:45:56 UTC (389 KB)
[v5] Thu, 9 Apr 2020 17:50:03 UTC (388 KB)

Computer Science > Information Theory

Title:Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Theory

Title:Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators