Running Jobs in Pyxis

TensorWave Slurm integrates Pyxis, a container runtime plugin for Slurm that enables users to run containerized workloads directly within their jobs.

This integration lets you launch distributed AI or HPC jobs inside optimized ROCm containers while maintaining full GPU, RDMA, and filesystem performance.

Containers are the preferred way to run workloads in TensorWave Slurm. They ensure consistent environments across nodes, simplify dependency management, and let you reproduce results reliably.


Running Your First Containerized Job

In this example, you’ll run a multi-node RCCL performance test using Pyxis. This verifies that your containerized environment can access GPUs, RDMA interfaces, and Slurm’s MPI orchestration.

  1. Create a new job script named rccl-pyxis.sbatch:

#!/bin/bash
#SBATCH --job-name=rccl_multi_node
#SBATCH --output=results/rccl_multi_node-%j.out
#SBATCH --error=results/rccl_multi_node-%j.out
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=16  
#SBATCH -N4

CONTAINER_IMAGE='tensorwavehq/pytorch-bnxt:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2'

export NCCL_IB_QPS_PER_CONNECTION=2
export NCCL_BUFFSIZE=8388608
export UCX_NET_DEVICES=eno0

# Minimize uneccessary logs when running with Pyxis
export OMPI_MCA_btl=^openib
export PMIX_MCA_gds=hash
export UCX_WARN_UNUSED_ENV_VARS=n

srun --mpi=pmix \
  --container-writable \
  --container-name=rccl-pyxis-run \
  --container-image=${CONTAINER_IMAGE} \
  /usr/local/bin/rccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
  1. Submit the job to Slurm:

sbatch --nodes=<number-of-nodes> rccl-pyxis.sbatch
  1. Monitor progress:

squeue -u $USER

Once complete, your results will appear under the results/ directory, with each job’s output and error logs named using the Slurm job ID.


Using a Pre-Staged SquashFS Image

Instead of pulling a container from a registry, you can point Slurm directly to a pre-staged SquashFS (.sqsh) image. This is often faster and preferred for large models or shared environments.

Example:

# You can set the image in the previous example to a local .sqsh file
CONTAINER_IMAGE='tensorwavehq+pytorch-bnxt+rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2.sqsh'

Using a local .sqsh file avoids repeated network pulls and ensures consistent environments across jobs.


Pyxis Flags

Pyxis extends Slurm with several container-related flags that control how your job interacts with the container environment. Below are the most commonly used options:

Flag
Description

--container-image

Specifies the container to run. Accepts Docker/OCI URLs or local .sqsh images.

--container-writable

Makes the container filesystem writable during execution. Useful for logs, checkpoints, or temporary files.

--container-mounts=/src:/dst[,/src2:/dst2]

Binds local or shared directories into the container. Multiple mounts can be separated by commas.

--container-workdir=/path

Sets the working directory inside the container (defaults to /).

--container-name=<name>

Assigns a name to the running container instance, useful for debugging or monitoring.


Learn More

For advanced configuration options and the full list of supported flags, see the official containers documentation from SchedMD: https://slurm.schedmd.com/containers.html

Last updated