Performance
Network Topology
RCCL All-Reduce Test
Running the test
sbatch /opt/tw/examples/libexec/rccl.sbatch#!/bin/bash
#SBATCH --job-name=rccl_tests
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00
#SBATCH --nodes=2
set -euxo pipefail
# Use 2 InfiniBand queue pairs per connection between ranks
export NCCL_IB_QPS_PER_CONNECTION=2
# Double buffer size for NCCL communications
export NCCL_BUFFSIZE=8388608
# Prevent MPI from using InfiniBand
export UCX_NET_DEVICES=eno0
srun /opt/rccl-tests/all_reduce_perf -b 512M -e 8G -f 2 -g 1Script parameters
Variable
Value
Purpose
Argument
Value
Description
Reading the output
Results on an 4-node MI355X Cluster
Last updated

