Running Containerized Jobs in Pyxis

TensorWave Slurm integrates Pyxis, a container runtime plugin for Slurm that enables users to run containerized workloads directly within their jobs.

This integration lets you launch distributed AI or HPC jobs inside optimized ROCm containers while maintaining full GPU, RDMA, and filesystem performance. Containers provide environment isolation, ensure consistent environments across nodes, simplify dependency management, and let you reproduce results reliably.


Running Your First Containerized Job

In this example, you’ll run a PyTorch matmul test using Pyxis. This verifies that your containerized environment can access GPUs, RDMA interfaces, and Slurm’s MPI orchestration.

  1. Create a Python script to measure matmul performance. Copy the following code block into a file named torch_matmul.py.

  2. Create a new job script named torch-matmul-pyxis.sbatch:

    torch-matmul-pyxis.sbatch
    #!/bin/bash
    #SBATCH --job-name=torch_matmul
    #SBATCH --output=jid-%j.name-%x.log
    #SBATCH --gpus-per-node=8
    #SBATCH -N1
    
    # Script created in step 1.
    MATMUL_PY="$PWD/torch_matmul.py" 
    # pytorch-rocm image from Docker Hub, published by AMD
    CONTAINER_IMAGE='rocm/pytorch:rocm7.1.1_ubuntu22.04_py3.10_pytorch_release_2.9.1'
    CONTAINER_NAME="pytorch_matmul_test"
    
    # Download the image and instantiate the container
    srun --container-name=$CONTAINER_NAME --container-image=$CONTAINER_IMAGE true
    
    # Run the benchmark
    srun --container-writable \
      --container-name=$CONTAINER_NAME \
      --container-mounts="$MATMUL_PY:/root/torch_matmul.py" \
      /opt/venv/bin/python /root/torch_matmul.py
    
    # Save the image to disk for use later
    srun --container-name=$CONTAINER_NAME \
      --container-save=$PWD/torch-matmul.sqsh \
      true
  3. Submit the job. Here's an example run:

    $ sbatch torch-matmul-pyxis.sbatch
    Submitted batch job 90
    $ tail -f jid-90.name-torch_matmul.log
    pyxis: importing docker image: rocm/pytorch:rocm7.1.1_ubuntu22.04_py3.10_pytorch_release_2.9.1
    pyxis: imported docker image: rocm/pytorch:rocm7.1.1_ubuntu22.04_py3.10_pytorch_release_2.9.1
    Device: AMD Instinct MI325X
    n= 1024  144.42 TFLOPs
    n= 2048  466.27 TFLOPs
    n= 4096  640.01 TFLOPs
    n= 8192  763.36 TFLOPs
    pyxis: exported container pyxis_90_pytorch_matmul_test to /home/[email protected]/snpyxis/torch-matmul.sqsh
    ^C
    $ ls
    jid-90.name-torch_matmul.log  torch-matmu-pyxis.sbatch  torch-matmul.sqsh  torch_matmul.py

To highlight some of the key features of the torch-matmul-pyxis.sbatch file:

  • Container images are automatically pulled from Docker Hub

  • Named containers are persistent across srun invocations.

    • On line 14, the "pytorch_matmul_test" container is created.

    • On line 18, the container is invoked to run the benchmark.

    • On line 23, the container is saved to disk for reuse later.

  • On line 19: Use the --container-mounts flags to bridge data into and out of the container

  • On line 24: The --container-save flag writes the "pytorch_matmul_test" container as a squashfs file.


Using a Pre-Staged SquashFS Image

Instead of pulling a container from a registry, you can point Slurm directly to a pre-staged SquashFS (.sqsh) image. This is often faster and preferred for large models or shared environments, as using a local .sqsh file avoids repeated network pulls and ensures consistent environments across jobs.

To modify the torch-matmul-pyxis.sbatch example to use a squashfs file, modify the following line:

Squashfs files can be generated by the --container-save flag, or generated via enroot. See the Managing Container Images with Enrootpage for more info on generating .sqsh images.


Pyxis Flags

Pyxis extends Slurm with several container-related flags that control how your job interacts with the container environment. Below are the most commonly used options:

Flag
Description

--container-image

Specifies the container to run. Accepts Docker/OCI URLs or local .sqsh images.

--container-writable

Makes the container filesystem writable during execution. Useful for logs, checkpoints, or temporary files.

--container-mounts=/src:/dst[,/src2:/dst2]

Binds local or shared directories into the container. Multiple mounts can be separated by commas.

--container-workdir=/path

Sets the working directory inside the container (defaults to /).

--container-name=<name>

Assigns a name to the running container instance, useful for debugging or monitoring.

--container-save=PATH

Save the container state to a squashfs file on the remote host filesystem.


Learn More

For advanced configuration options and the full list of supported flags, see the official containers documentation from SchedMD: https://slurm.schedmd.com/containers.htmlarrow-up-right

Last updated