Paper
8 June 2024 Network communication optimization of RCCL communication library in Multi-NIC systems
Shuaiming He, Wei Wan, Junhong Li
Author Affiliations +
Proceedings Volume 13171, Third International Conference on Algorithms, Microchips, and Network Applications (AMNA 2024); 131711Q (2024) https://doi.org/10.1117/12.3031956
Event: 3rd International Conference on Algorithms, Microchips and Network Applications (AMNA 2024), 2024, Jinan, China
Abstract
With the widespread application of deep learning frameworks, large-scale computing and GPU programming are receiving increased attention. For upper-layer applications that utilize GPUs for computational communication, such as TensorFlow and PyTorch, improving the communication efficiency of the underlying communication library is of paramount importance to enhance the overall performance of the frameworks. Among them, the RCCL (Rocm Collective Communication Library) GPU communication library, provided by the Rocm (Radeon Open Compute platform) computing platform, supports various collective communication operations and point-to-point operations. Through analysis, we have identified a problem in the initialization and usage of the ring channel network in the RCCL library, specifically in multi-network card systems. This issue results in certain network cards being unable to communicate, leading to wasted system resources. To address this problem, optimizations can be made at the code level by introducing data structures and algorithms to control the invocation of network cards. The goal is to adjust the usage strategy of multiple network cards in the ring channel network without modifying the original design concept of RCCL. After optimization, extensive evaluations were conducted using a large-scale GPU cluster. The optimized RCCL library achieved significant improvements in communication performance. Under a communication scale of 16 compute nodes and 64 GPUs, the peak bandwidth increased from 5.28GB/s to 7.78GB/s. In inter-node collective communication tests, the performance improvement reached up to 60%. The improved RCCL library provides better low-level communication performance for upper-layer applications on the Rocm computing platform, offering enhanced communication support.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Shuaiming He, Wei Wan, and Junhong Li "Network communication optimization of RCCL communication library in Multi-NIC systems", Proc. SPIE 13171, Third International Conference on Algorithms, Microchips, and Network Applications (AMNA 2024), 131711Q (8 June 2024); https://doi.org/10.1117/12.3031956
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Data communications

Education and training

Computing systems

Mathematical optimization

Network architectures

Systems modeling

Computer architecture

Back to Top