For the complete documentation index, see llms.txt. This page is also available as Markdown.

Network Drivers

How to get docker images up-and-running on TensorWave clusters.

To get multi-node containerized applications, two major steps need to happen: 1. make sure host-network passthrough is enabled, and 2. make sure the proper drivers exist in the image.

Enabling host network passthrough is straightforward. When starting a Docker container, add --network host to the command. e.g.: docker run --network host rocm/pytorch:latest. When configuring Kubernetes, add hostNetwork: true to the pod spec. Here's an example of a pod with a host network configured:

Copy
apiVersion: v1
kind: Pod
metadata:
  name: host-network-example
spec:
  hostNetwork: true # <=== this enables host network
  containers:
  - name: pytorch
    image: tensorwavehq/pytorch:latest

If you're using Apptainer or Pyxis on the TensorWave's Slurm platform, host-networks are configured by default. However, you may need to ensure the proper drivers are installed in your image. The sections below provide instructions for driver installation on our two platforms.

Note If you're using apptainer in Slurm, you can use TensorWave's provided condainer device interface (CDI) spec to mount the worker pod's network software into the host. See The CDI section below for details.

AINIC (MI355X)

The MI355X nodes use AMD Polara NICs for the backend network. AMD provides their NIC software as a binary. To install it, add the upstream apt source and apt-install the appropriate software.

Below, we provide a sample Dockerfile that installs the AINIC drivers inside AMD's ROCM PyTorch image.

Copy the file to your local machine, and run docker build -f <dockerfiel> -t rocm-pytorch-ainic .. You could also add this section as a build stage in your own Dockerfile build pipeline. Add it to the top of your Docker file, and update the appropriate FROM <image> AS <build-stage> lines to fold it into the build pipeline

Broadcom (MI325X & MI300X)

Tensorwave MI300X and MI325X clusters use Broadcom Ethernet cards for the backend network. These require building Broadcom's driver from source code.

The following Dockerfile builds the Broadcom drivers into AMD's ROCm PyTorch image. It can be used from a slrum-login node. Copy the file to your login environment and run docker build -f <dockerfile> -t rocm-pytorch-bnxt .. If you don't have access to TensorWave's Slurm platform and need the libbnxt_re tarfile, reach out to your cluster adminitstrator.

Mounting Worker-Pod Network Software with CDI

Note: This feature is in beta.

If you are using TensorWave's Managed Slurm, we provide a /etc/cdi/tw.json file to mount the userspace InfiniBand software into an image. This has been validated to work with apptainer using images built off rocm/pytorch with Ubuntu22.04. To enable it, pass the appropriate --device option when launching an apptainer image.

Here's an example sbatch script running a pytorch RCCL benchmark. The --device value is either tw.amd.com/bnxt=bnxt or tw.amd.com/ainic=ainic depending on the cluster's NICs. There is also an example in the slurm login environment at /opt/tw/examples/libexec/rccl-torch-apptainer.sbatch.

Last updated