For the complete documentation index, see llms.txt. This page is also available as Markdown.

Performance

Network Topology

TensorWave GPU clusters are designed around two complementary networking optimizations that directly affect multi-GPU workload performance.

Rail-optimized networking. Each GPU has a dedicated RDMA NIC, giving it an exclusive high-bandwidth path for collective communications. With 8 GPUs per node, there are 8 independent rails (rdma0-rdma7), one per GPU. This means RCCL ring and tree algorithms can saturate all available bandwidth simultaneously without GPUs competing for shared NIC resources.

Topology-aware scheduling. Nodes are organized into physical pods, each sharing a top-of-rack network fabric. Slurm's tree topology plugin uses a topology.conf that maps nodes to pods and pods to a spine, allowing the scheduler to preferentially allocate nodes within the same pod for a given job. For multi-node jobs, this reduces cross-switch hops and keeps the majority of collective traffic on the lower-latency intra-pod fabric.


RCCL All-Reduce Test

RCCL (ROCm Collective Communications Library) tests are pre-installed on compute nodes at /opt/rccl-tests/. The all_reduce_perf benchmark measures collective communication bandwidth across GPUs and is useful for validating interconnect performance and identifying nodes with degraded network throughput. A sample sbatch job driving all_reduce_perf is provided at /opt/tw/examples/libexec/rccl.sbatch.


Running the test

Create an output directory and submit the job:

sbatch /opt/tw/examples/libexec/rccl.sbatch

**rccl.sbatch:**

#!/bin/bash
#SBATCH --job-name=rccl_tests
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00
#SBATCH --nodes=2

set -euxo pipefail

# Use 2 InfiniBand queue pairs per connection between ranks
export NCCL_IB_QPS_PER_CONNECTION=2

# Double buffer size for NCCL communications
export NCCL_BUFFSIZE=8388608

# Prevent MPI from using InfiniBand
export UCX_NET_DEVICES=eno0

srun /opt/rccl-tests/all_reduce_perf -b 512M -e 8G -f 2 -g 1

To run on more nodes, override the --nodes value at submission time:


Script parameters

Environment variables

Variable
Value
Purpose

NCCL_IB_QPS_PER_CONNECTION

2

Increases InfiniBand queue pairs per connection, improving routing entropy and throughput.

NCCL_BUFFSIZE

8388608

Sets the RCCL communication buffer to 8 MB. Larger buffers can improve performance at high message sizes.

UCX_NET_DEVICES

eno0

Directs UCX control traffic over Ethernet, leaving InfiniBand dedicated to RCCL data traffic.

NCCL_IB_GID_INDEX

1 or 3

Specifies which GID index RCCL should use, values are dependent on the NIC vendor of your cluster.

RCCL test arguments

Argument
Value
Description

-b

512M

Minimum message size

-e

8G

Maximum message size

-f

2

Step factor (doubles each step: 512M, 1G, 2G, ..., 8G)

-g

8

GPUs per process


Reading the output

A successful run completes without errors and shows increasing bus bandwidth as message size grows. Key fields in the output:

Results on an 4-node MI355X Cluster

  • **algbw** — algorithm bandwidth: message size divided by time. Reflects how quickly one collective operation completes.

  • **busbw** — bus bandwidth: algbw corrected for the number of ranks. Better reflects peak hardware utilization.

  • **#wrong** — should be 0. Any non-zero value indicates a data correctness error.

  • **Avg bus bandwidth** — average busbw across all message sizes. Useful as a single summary figure for comparison.

A healthy cluster shows busbw increasing steadily with message size and leveling off at a stable peak at larger sizes. Nodes with degraded interconnect will show lower busbw or fail to complete. Target busbw values will be dependent on your cluster architecture.

Last updated