Exascale Deep Learning for Scientific Inverse Problems

Laanait, Nouamane; Romero, Joshua; Yin, Junqi; Young, M. Todd; Treichler, Sean; Starchenko, Vitalii; Borisevich, Albina; Sergeev, Alex; Matheson, Michael

Computer Science > Machine Learning

arXiv:1909.11150 (cs)

[Submitted on 24 Sep 2019]

Title:Exascale Deep Learning for Scientific Inverse Problems

Authors:Nouamane Laanait, Joshua Romero, Junqi Yin, M. Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, Michael Matheson

View PDF

Abstract:We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors. These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. We demonstrate our gradient reduction techniques in the context of training a Fully Convolutional Neural Network to approximate the solution of a longstanding scientific inverse problem in materials imaging. The efficient distributed training on a dataset size of 0.5 PB, produces a model capable of an atomically-accurate reconstruction of materials, and in the process reaching a peak performance of 2.15(4) EFLOPS$_{16}$.

Comments:	13 pages, 9 figures. Under review by the Systems and Machine Learning (SysML) Conference (SysML '20)
Subjects:	Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Distributed, Parallel, and Cluster Computing (cs.DC); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
Cite as:	arXiv:1909.11150 [cs.LG]
	(or arXiv:1909.11150v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1909.11150

Submission history

From: Nouamane Laanait [view email]
[v1] Tue, 24 Sep 2019 19:40:59 UTC (8,214 KB)

Computer Science > Machine Learning

Title:Exascale Deep Learning for Scientific Inverse Problems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Exascale Deep Learning for Scientific Inverse Problems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators