> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/containers/network-drivers.md). # Network Drivers ### How to get docker images up-and-running on TensorWave clusters. To get multi-node containerized applications, two major steps need to happen: 1. make sure host-network passthrough is enabled, and 2. make sure the proper drivers exist in the image. Enabling host network passthrough is straightforward. When starting a Docker container, add --network host to the command. e.g.: `docker run --network host rocm/pytorch:latest`. When configuring Kubernetes, add hostNetwork: true to the pod spec. Here's an example of a pod with a host network configured: ``` Copy apiVersion: v1 kind: Pod metadata: name: host-network-example spec: hostNetwork: true # <=== this enables host network containers: - name: pytorch image: tensorwavehq/pytorch:latest ``` If you're using Apptainer or Pyxis on the TensorWave's Slurm platform, host-networks are configured by default. However, you may need to ensure the proper drivers are installed in your image. The sections below provide instructions for driver installation on our two platforms. > **Note** If you're using apptainer in Slurm, you can use TensorWave's provided container device interface (CDI) spec to mount the worker pod's network software into the host. See The CDI section below for details. ### AINIC (MI355X) The MI355X nodes use AMD Polara NICs for the backend network. AMD provides their NIC software as a binary. To install it, add the upstream apt source and apt-install the appropriate software. Below, we provide a sample Dockerfile that installs the AINIC drivers inside AMD's ROCM PyTorch image. Copy the file to your local machine, and run `docker build -f -t rocm-pytorch-ainic .`. You could also add this section as a build stage in your own Dockerfile build pipeline. Add it to the top of your Docker file, and update the appropriate `FROM AS ` lines to fold it into the build pipeline ```dockerfile FROM rocm/pytorch:latest AS rocm-pytorch-ainic ENV DEBIAN_FRONTEND=noninteractive WORKDIR /tmp ARG REPO_URL=https://repo.radeon.com ARG DRIVERS_VERSION=1.117.5-a-56 RUN << EOR set -eux UBUNTU_CODENAME=$(awk -F= '/^UBUNTU_CODENAME=/{gsub(/"/,"",$2); print $2}' /etc/os-release) mkdir --parents --mode=0755 /etc/apt/keyrings wget ${REPO_URL}/rocm/rocm.gpg.key -O - | gpg --dearmor | tee /etc/apt/keyrings/rocm.gpg > /dev/null echo "Types: deb" > /etc/apt/sources.list.d/amdainic.sources echo "URIs: ${REPO_URL}/amdainic/pensando/ubuntu/${DRIVERS_VERSION}" >> /etc/apt/sources.list.d/amdainic.sources echo "Suites: ${UBUNTU_CODENAME}" >> /etc/apt/sources.list.d/amdainic.sources echo "Components: main" >> /etc/apt/sources.list.d/amdainic.sources echo "Signed-By: /etc/apt/keyrings/rocm.gpg" >> /etc/apt/sources.list.d/amdainic.sources apt-get update && apt-get install -y bc jq libibverbs-dev rdma-core ibverbs-utils libionic-dev libionic1 perftest libfmt-dev EOR ``` ### Broadcom (MI325X & MI300X) Tensorwave MI300X and MI325X clusters use Broadcom Ethernet cards for the backend network. These require building Broadcom's driver from source code. The following Dockerfile builds the Broadcom drivers into AMD's ROCm PyTorch image. It can be used from a slurm-login node. Copy the file to your login environment and run `docker build -f -t rocm-pytorch-bnxt .`. If you don't have access to TensorWave's Slurm platform and need the libbnxt\_re tarfile, reach out to your cluster administrator. ```dockerfile Copy FROM rocm/pytorch:latest AS rocm-pytorch-bnxt ENV DEBIAN_FRONTEND=noninteractive WORKDIR /tmp ARG BNXT_VER=233.0.152.2 ARG BNXT_NAME=libbnxt_re-${BNXT_VER} COPY /opt/tw/drivers/${BNXT_NAME}.tar.gz /tmp/bnxt-drivers/ RUN << EOR set -eux export apt-get -qq update apt-get -qq -y install --no-install-recommends \ autoconf automake bison build-essential ethtool g++ hwloc \ ibverbs-utils infiniband-diags initramfs-tools iproute2 \ iputils-ping kmod libibverbs-dev libibumad-dev libncurses5-dev \ librdmacm-dev libsysfs-dev libtool make net-tools pciutils \ plocate strace sudo vim wget which dash &> /dev/null && (\ echo "dash dash/sh boolean false" | debconf-set-selections && \ dpkg-reconfigure dash) || \ echo "Skipping dash reconfigure (not applicable)" ls /tmp/bnxt-drivers tar -xzf /tmp/bnxt-drivers/${BNXT_NAME}.tar.gz -C /tmp/bnxt-drivers/ mv /usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so /usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so.hidden cd /tmp/bnxt-drivers/${BNXT_NAME}/ && sh ./autogen.sh && ./configure make -C /tmp/bnxt-drivers/${BNXT_NAME} clean all install echo '/usr/local/lib' > /etc/ld.so.conf.d/libbnxt_re.conf ldconfig EOR ``` ### Mounting Worker-Pod Network Software with CDI > **Note**: This feature is in beta. If you are using TensorWave's Managed Slurm, we provide a `/etc/cdi/tw.json` file to mount the userspace InfiniBand software into an image. This has been validated to work with apptainer using images built off [rocm/pytorch with Ubuntu22.04](https://hub.docker.com/r/rocm/pytorch/tags?name=ubuntu22.04). To enable it, pass the appropriate `--device` option when launching an apptainer image. Here's an example sbatch script running a pytorch RCCL benchmark. The `--device` value is either `tw.amd.com/bnxt=bnxt` or `tw.amd.com/ainic=ainic` depending on the cluster's NICs. There is also an example in the slurm login environment at `/opt/tw/examples/libexec/rccl-torch-apptainer.sbatch`. ```bash #!/bin/bash #SBATCH --job-name=torch_all_reduce_apptainer_cdi #SBATCH --output=jid-%j.name-%x.log #SBATCH --tasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --cpus-per-task=128 #SBATCH --nodes=4 if [[ -e "/opt/tw/drivers/libbnxt_re-233.0.152.2.tar.gz" ]]; then CDI_DEVICE="tw.amd.com/bnxt=bnxt" else CDI_DEVICE="tw.amd.com/ainic=ainic" fi MASTER_ADDR=$(cat /etc/hostname) srun apptainer exec \ --device $CDI_DEVICE \ --bind /opt/tw/examples/bin \ docker://rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0 \ bash -s <&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.