> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/performance.md).

# Performance

### Network Topology

TensorWave GPU clusters are designed around two complementary networking optimizations that directly affect multi-GPU workload performance.

**Rail-optimized networking.** Each GPU has a dedicated RDMA NIC, giving it an exclusive high-bandwidth path for collective communications. With 8 GPUs per node, there are 8 independent rails (`rdma0`-`rdma7`), one per GPU. This means RCCL ring and tree algorithms can saturate all available bandwidth simultaneously without GPUs competing for shared NIC resources.

**Topology-aware scheduling.** Nodes are organized into physical pods, each sharing a top-of-rack network fabric. Slurm's tree topology plugin uses a `topology.conf` that maps nodes to pods and pods to a spine, allowing the scheduler to preferentially allocate nodes within the same pod for a given job. For multi-node jobs, this reduces cross-switch hops and keeps the majority of collective traffic on the lower-latency intra-pod fabric.

***

### RCCL All-Reduce Test

RCCL (ROCm Collective Communications Library) tests are pre-installed on compute nodes at `/opt/rccl-tests/`. The `all_reduce_perf` benchmark measures collective communication bandwidth across GPUs and is useful for validating interconnect performance and identifying nodes with degraded network throughput. A sample sbatch job driving `all_reduce_perf` is provided at `/opt/tw/examples/libexec/rccl.sbatch`.

***

#### Running the test

Create an output directory and submit the job:

```bash
sbatch /opt/tw/examples/libexec/rccl.sbatch
```

`**rccl.sbatch`:\*\*

```bash
#!/bin/bash
#SBATCH --job-name=rccl_tests
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00
#SBATCH --nodes=2

set -euxo pipefail

# Use 2 InfiniBand queue pairs per connection between ranks
export NCCL_IB_QPS_PER_CONNECTION=2

# Double buffer size for NCCL communications
export NCCL_BUFFSIZE=8388608

# Prevent MPI from using InfiniBand
export UCX_NET_DEVICES=eno0

srun /opt/rccl-tests/all_reduce_perf -b 512M -e 8G -f 2 -g 1
```

To run on more nodes, override the `--nodes` value at submission time:

```bash
sbatch --nodes <nnodes> /opt/tw/examples/libexec/rccl.sbatch
```

***

#### Script parameters

**Environment variables**

| Variable                     | Value      | Purpose                                                                                                   |
| ---------------------------- | ---------- | --------------------------------------------------------------------------------------------------------- |
| `NCCL_IB_QPS_PER_CONNECTION` | `2`        | Increases InfiniBand queue pairs per connection, improving routing entropy and throughput.                |
| `NCCL_BUFFSIZE`              | `8388608`  | Sets the RCCL communication buffer to 8 MB. Larger buffers can improve performance at high message sizes. |
| `UCX_NET_DEVICES`            | `eno0`     | Directs UCX control traffic over Ethernet, leaving InfiniBand dedicated to RCCL data traffic.             |
| `NCCL_IB_GID_INDEX`          | `1` or `3` | Specifies which GID index RCCL should use, values are dependent on the NIC vendor of your cluster.        |

**RCCL test arguments**

| Argument | Value  | Description                                            |
| -------- | ------ | ------------------------------------------------------ |
| `-b`     | `512M` | Minimum message size                                   |
| `-e`     | `8G`   | Maximum message size                                   |
| `-f`     | `2`    | Step factor (doubles each step: 512M, 1G, 2G, ..., 8G) |
| `-g`     | `8`    | GPUs per process                                       |

***

#### Reading the output

A successful run completes without errors and shows increasing bus bandwidth as message size grows. Key fields in the output:

#### Results on an 4-node MI355X Cluster

```
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   536870912     134217728     float     sum      -1   2945.0  182.30  353.20      0   2947.0  182.17  352.96      0
  1073741824     268435456     float     sum      -1   5452.6  196.92  381.54      0   5446.2  197.16  381.99      0
  2147483648     536870912     float     sum      -1    10806  198.72  385.03      0    10817  198.52  384.63      0
  4294967296    1073741824     float     sum      -1    21843  196.63  380.97      0    21846  196.60  380.91      0
  8589934592    2147483648     float     sum      -1    44986  190.95  369.96      0    44930  191.19  370.42      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 374.161
#
# Collective test concluded: all_reduce_perf
```

* `**algbw**` — algorithm bandwidth: message size divided by time. Reflects how quickly one collective operation completes.
* `**busbw**` — bus bandwidth: `algbw` corrected for the number of ranks. Better reflects peak hardware utilization.
* `**#wrong**` — should be `0`. Any non-zero value indicates a data correctness error.
* `**Avg bus bandwidth**` — average `busbw` across all message sizes. Useful as a single summary figure for comparison.

A healthy cluster shows `busbw` increasing steadily with message size and leveling off at a stable peak at larger sizes. Nodes with degraded interconnect will show lower `busbw` or fail to complete. Target `busbw` values will be dependent on your cluster architecture.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/performance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
