> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/containers/network-drivers.md).

# Network Drivers

### How to get docker images up-and-running on TensorWave clusters.

To get multi-node containerized applications, two major steps need to happen: 1. make sure host-network passthrough is enabled, and 2. make sure the proper drivers exist in the image.

Enabling host network passthrough is straightforward. When starting a Docker container, add --network host to the command. e.g.: `docker run --network host rocm/pytorch:latest`. When configuring Kubernetes, add hostNetwork: true to the pod spec. Here's an example of a pod with a host network configured:

```
Copy
apiVersion: v1
kind: Pod
metadata:
  name: host-network-example
spec:
  hostNetwork: true # <=== this enables host network
  containers:
  - name: pytorch
    image: tensorwavehq/pytorch:latest
```

If you're using Apptainer or Pyxis on the TensorWave's Slurm platform, host-networks are configured by default. However, you may need to ensure the proper drivers are installed in your image. The sections below provide instructions for driver installation on our two platforms.

> **Note** If you're using apptainer in Slurm, you can use TensorWave's provided container device interface (CDI) spec to mount the worker pod's network software into the host. See The CDI section below for details.

### AINIC (MI355X)

The MI355X nodes use AMD Polara NICs for the backend network. AMD provides their NIC software as a binary. To install it, add the upstream apt source and apt-install the appropriate software.

Below, we provide a sample Dockerfile that installs the AINIC drivers inside AMD's ROCM PyTorch image.

Copy the file to your local machine, and run `docker build -f <dockerfiel> -t rocm-pytorch-ainic .`. You could also add this section as a build stage in your own Dockerfile build pipeline. Add it to the top of your Docker file, and update the appropriate `FROM <image> AS <build-stage>` lines to fold it into the build pipeline

```dockerfile
FROM rocm/pytorch:latest AS rocm-pytorch-ainic
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /tmp

ARG REPO_URL=https://repo.radeon.com
ARG DRIVERS_VERSION=1.117.5-a-56

RUN << EOR
set -eux

UBUNTU_CODENAME=$(awk -F= '/^UBUNTU_CODENAME=/{gsub(/"/,"",$2); print $2}' /etc/os-release)

mkdir --parents --mode=0755 /etc/apt/keyrings
wget ${REPO_URL}/rocm/rocm.gpg.key -O - | gpg --dearmor | tee /etc/apt/keyrings/rocm.gpg > /dev/null
echo "Types: deb" > /etc/apt/sources.list.d/amdainic.sources
echo "URIs: ${REPO_URL}/amdainic/pensando/ubuntu/${DRIVERS_VERSION}" >> /etc/apt/sources.list.d/amdainic.sources
echo "Suites: ${UBUNTU_CODENAME}" >> /etc/apt/sources.list.d/amdainic.sources
echo "Components: main" >> /etc/apt/sources.list.d/amdainic.sources
echo "Signed-By: /etc/apt/keyrings/rocm.gpg" >> /etc/apt/sources.list.d/amdainic.sources

apt-get update && apt-get install -y bc jq libibverbs-dev rdma-core ibverbs-utils libionic-dev libionic1 perftest libfmt-dev

EOR
```

### Broadcom (MI325X & MI300X)

Tensorwave MI300X and MI325X clusters use Broadcom Ethernet cards for the backend network. These require building Broadcom's driver from source code.

The following Dockerfile builds the Broadcom drivers into AMD's ROCm PyTorch image. It can be used from a slurm-login node. Copy the file to your login environment and run `docker build -f <dockerfile> -t rocm-pytorch-bnxt .`. If you don't have access to TensorWave's Slurm platform and need the libbnxt\_re tarfile, reach out to your cluster administrator.

```dockerfile
Copy
FROM rocm/pytorch:latest AS rocm-pytorch-bnxt
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /tmp

ARG BNXT_VER=233.0.152.2
ARG BNXT_NAME=libbnxt_re-${BNXT_VER}

COPY /opt/tw/drivers/${BNXT_NAME}.tar.gz /tmp/bnxt-drivers/

RUN << EOR
set -eux
export
apt-get -qq update
apt-get -qq -y install --no-install-recommends \
    autoconf automake bison build-essential ethtool g++ hwloc \
    ibverbs-utils infiniband-diags initramfs-tools iproute2 \
    iputils-ping kmod libibverbs-dev libibumad-dev libncurses5-dev \
    librdmacm-dev libsysfs-dev libtool make net-tools pciutils \
    plocate strace sudo vim wget

which dash &> /dev/null && (\
    echo "dash dash/sh boolean false" | debconf-set-selections && \
     dpkg-reconfigure dash) || \
    echo "Skipping dash reconfigure (not applicable)"

ls /tmp/bnxt-drivers
tar -xzf /tmp/bnxt-drivers/${BNXT_NAME}.tar.gz -C /tmp/bnxt-drivers/
mv /usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so /usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so.hidden
cd /tmp/bnxt-drivers/${BNXT_NAME}/ && sh ./autogen.sh && ./configure
make -C /tmp/bnxt-drivers/${BNXT_NAME} clean all install
echo '/usr/local/lib' > /etc/ld.so.conf.d/libbnxt_re.conf
ldconfig

EOR
```

### Mounting Worker-Pod Network Software with CDI

> **Note**: This feature is in beta.

If you are using TensorWave's Managed Slurm, we provide a `/etc/cdi/tw.json` file to mount the userspace InfiniBand software into an image. This has been validated to work with apptainer using images built off [rocm/pytorch with Ubuntu22.04](https://hub.docker.com/r/rocm/pytorch/tags?name=ubuntu22.04). To enable it, pass the appropriate `--device` option when launching an apptainer image.

Here's an example sbatch script running a pytorch RCCL benchmark. The `--device` value is either `tw.amd.com/bnxt=bnxt` or `tw.amd.com/ainic=ainic` depending on the cluster's NICs. There is also an example in the slurm login environment at `/opt/tw/examples/libexec/rccl-torch-apptainer.sbatch`.

```bash
#!/bin/bash
#SBATCH --job-name=torch_all_reduce_apptainer_cdi
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --tasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=128
#SBATCH --nodes=4

if [[ -e "/opt/tw/drivers/libbnxt_re-233.0.152.2.tar.gz" ]]; then
    CDI_DEVICE="tw.amd.com/bnxt=bnxt"
else
    CDI_DEVICE="tw.amd.com/ainic=ainic"
fi

MASTER_ADDR=$(cat /etc/hostname)
srun apptainer exec  \
  --device $CDI_DEVICE \
  --bind /opt/tw/examples/bin \
  docker://rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0 \
  bash -s <<EOF

LOCAL_ADDR=\$(hostname)
REMOTE_ADDR=${MASTER_ADDR}
IS_HOST=0
if [ "\$LOCAL_ADDR" == "${MASTER_ADDR}" ];then
  IS_HOST=1
  REMOTE_ADDR=localhost
fi

/opt/venv/bin/python -u -m torch.distributed.run \
 --nproc_per_node 8 \
 --nnodes ${SLURM_NNODES} \
 --rdzv_endpoint \${REMOTE_ADDR}:6000 \
 --rdzv_backend c10d \
 --max_restarts 0 \
 --rdzv_id=1 \
 --rdzv_conf=is_host=\$IS_HOST \
 --local_addr "\${LOCAL_ADDR}" \
 /opt/tw/examples/bin/rccl-bench.py --collective all_reduce

EOF
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.tensorwave.com/slurm/containers/network-drivers.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
