Distributed Graph Neural Network Training: A Survey

Shao, Yingxia; Li, Hongzheng; Gu, Xizhi; Yin, Hongbo; Li, Yawen; Miao, Xupeng; Zhang, Wentao; Cui, Bin; Chen, Lei

Published in

Association for Computing Machinery (ACM), ACM Computing Surveys, 8(56), p. 1-39, 2024

DOI: 10.1145/3648358

Tools

Export citation

Search in Google Scholar

Distributed Graph Neural Network Training: A Survey

Journal article published in 2024 by Yingxia Shao

, Hongzheng Li

, Xizhi Gu

, Hongbo Yin

, Yawen Li

, Xupeng Miao

, Wentao Zhang

, Bin Cui

, Lei Chen

This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review of the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training: massive feature communication, the loss of model accuracy, and workload imbalance. Then, we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories: GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the conclusion, we summarize existing distributed GNN systems for multi–graphics processing units (GPUs), GPU-clusters and central processing unit (CPU)-clusters, respectively, and present a discussion about the future direction of distributed GNN training.

Published in

Links

Tools

Distributed Graph Neural Network Training: A Survey

Abstract