Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Zhang, Meng; Hu, Qinghao; Sun, Peng; Wen, Yonggang; Zhang, Tianwei

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2303.01277 (cs)

[Submitted on 2 Mar 2023]

Title:Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Authors:Meng Zhang, Qinghao Hu, Peng Sun, Yonggang Wen, Tianwei Zhang

View PDF

Abstract:Training Graph Neural Networks (GNNs) on large graphs is challenging due to the conflict between the high memory demand and limited GPU memory. Recently, distributed full-graph GNN training has been widely adopted to tackle this problem. However, the substantial inter-GPU communication overhead can cause severe throughput degradation. Existing communication compression techniques mainly focus on traditional DNN training, whose bottleneck lies in synchronizing gradients and parameters. We find they do not work well in distributed GNN training as the barrier is the layer-wise communication of features during the forward pass & feature gradients during the backward pass. To this end, we propose an efficient distributed GNN training framework Sylvie, which employs one-bit quantization technique in GNNs and further pipelines the curtailed communication with computation to enormously shrink the overhead while maintaining the model quality. In detail, Sylvie provides a lightweight Low-bit Module to quantize the sent data and dequantize the received data back to full precision values in each layer. Additionally, we propose a Bounded Staleness Adaptor to control the introduced staleness to achieve further performance enhancement. We conduct theoretical convergence analysis and extensive experiments on various models & datasets to demonstrate Sylvie can considerably boost the training throughput by up to 28.1x.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2303.01277 [cs.DC]
	(or arXiv:2303.01277v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2303.01277

Submission history

From: Meng Zhang [view email]
[v1] Thu, 2 Mar 2023 14:02:39 UTC (549 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators