All pages
Powered by GitBook
1 of 3

Loading...

Loading...

Loading...

Slurm Quickstart

Slurm provides a multi-tenant framework for managing compute resources and jobs that span large clusters.

Overview

TensorWave Slurm combines the power of Slurm, the industry-standard workload manager for HPC and AI, with the flexibility of a Kubernetes-native orchestration layer. This integration delivers a modern, multi-tenant environment that scales seamlessly across AMD Instinct GPU clusters—enabling teams to run distributed training, fine-tuning, and simulation workloads without managing the underlying infrastructure.

With TensorWave, you get the familiar Slurm interface running on top of a cloud-native control plane that provides automated scheduling, easy scaling, and container-based execution.


Why Slurm on Kubernetes?

Traditional Slurm deployments were designed for static on-prem clusters. TensorWave modernizes that model by running Slurm inside Kubernetes, unlocking:

  • Scalable compute pools — resize your slurm cluster within your K8s environment.

  • Container-native workflows — integrate directly with your existing Docker or Enroot environments.

  • Multi-tenant isolation — each user or team runs in a secure namespace with defined resource limits.

The result is a unified, cloud-native scheduling experience that bridges HPC scalability with Kubernetes reliability.


Quickstart Example

1. Connect to Your Login Node

Each Slurm environment provides a login node, your interactive entry point for running Slurm commands. Behind the scenes, this login node runs as a managed Kubernetes pod, with the same Slurm interface (srun, sinfo, sbatch), and the benefits of cloud-native orchestration.

Once connected, you’ll have access to all standard Slurm utilities and your team’s partitions. From here, you can submit jobs, monitor queues, and launch multi-node workloads just as you would on a traditional HPC cluster.


2. Inspect Available Resources

List available partitions and node states:

Example:

Even though these nodes are dynamically managed by Kubernetes, the Slurm CLI remains identical to traditional HPC clusters.


3. Launch a Multi-Node Job

To verify connectivity and RDMA functionality, run a distributed RCCL test across four nodes (32 GPUs total):

Slurm automatically handles:

  • GPU and node allocation

  • Network interface binding

  • MPI coordination

Containers and More Info

Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.

ssh <username>@<slurm-login-endpoint>
sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpuworker*     up   infinite   1024   idle compute-[0-1023]
srun -N4 \
--mpi=pmix \
--ntasks-per-node=8 \
--gpus-per-node=8 \
--cpus-per-task=16 \
/usr/local/bin/rccl-tests/build/all_reduce_perf -b 32 -e 8g -f 2 -g 1

Enroot Containers

TensorWave Slurm uses Enroot as its lightweight, high-performance container runtime for HPC and AI workloads.

Unlike traditional container engines, Enroot runs entirely in user space with no privileged daemons or root access required, making it ideal for multi-tenant and secure compute environments.

Enroot executes standard Docker or OCI images as unprivileged user processes, unpacking each image into an isolated filesystem that can be shared across nodes. It preserves direct access to GPUs, high-speed interconnects, and local storage, ensuring your jobs are performant inside containers.

You don’t need to run Enroot commands directly; TensorWave Slurm handles that automatically through Pyxis, which integrates Enroot with familiar Slurm tools like srun and sbatch. Together, they allow you to launch containerized jobs using the same workflow you already know with the added benefits of portability and reproducibility.


Optional: Importing an Image with Enroot

Although Pyxis automatically handles Enroot under the hood, you can manually import container images for debugging or pre-caching.

For example, to pull and unpack a PyTorch ROCm image locally:

This workflow downloads the image, converts it into an Enroot container bundle, and runs it as an unprivileged user process.

You’ll typically never need to do this when submitting jobs through Pyxis, but it’s a useful way to verify container contents or pre-stage larger images.

Running Jobs in Pyxis

TensorWave Slurm integrates Pyxis, a container runtime plugin for Slurm that enables users to run containerized workloads directly within their jobs.

This integration lets you launch distributed AI or HPC jobs inside optimized ROCm containers while maintaining full GPU, RDMA, and filesystem performance.

Containers are the preferred way to run workloads in TensorWave Slurm. They ensure consistent environments across nodes, simplify dependency management, and let you reproduce results reliably.


Running Your First Containerized Job

In this example, you’ll run a multi-node RCCL performance test

# Import a container image into Enroot format
enroot import docker://tensorwavehq/pytorch-bnxt:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2

# Create a runnable instance
enroot start tensorwavehq+pytorch-bnxt+rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2.sqsh
using Pyxis. This verifies that your containerized environment can access GPUs, RDMA interfaces, and Slurm’s MPI orchestration.
  1. Create a new job script named rccl-pyxis.sbatch:

  1. Submit the job to Slurm:

  1. Monitor progress:

Once complete, your results will appear under the results/ directory, with each job’s output and error logs named using the Slurm job ID.


Using a Pre-Staged SquashFS Image

Instead of pulling a container from a registry, you can point Slurm directly to a pre-staged SquashFS (.sqsh) image. This is often faster and preferred for large models or shared environments.

Example:

Using a local .sqsh file avoids repeated network pulls and ensures consistent environments across jobs.


Pyxis Flags

Pyxis extends Slurm with several container-related flags that control how your job interacts with the container environment. Below are the most commonly used options:

Flag
Description

--container-image

Specifies the container to run. Accepts Docker/OCI URLs or local .sqsh images.

--container-writable

Makes the container filesystem writable during execution. Useful for logs, checkpoints, or temporary files.

--container-mounts=/src:/dst[,/src2:/dst2]

Binds local or shared directories into the container. Multiple mounts can be separated by commas.

--container-workdir=/path

Sets the working directory inside the container (defaults to /).

--container-name=<name>

Assigns a name to the running container instance, useful for debugging or monitoring.


Learn More

For advanced configuration options and the full list of supported flags, see the official containers documentation from SchedMD: https://slurm.schedmd.com/containers.html

#!/bin/bash
#SBATCH --job-name=rccl_multi_node
#SBATCH --output=results/rccl_multi_node-%j.out
#SBATCH --error=results/rccl_multi_node-%j.out
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=16  
#SBATCH -N4

CONTAINER_IMAGE='tensorwavehq/pytorch-bnxt:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2'

export NCCL_IB_QPS_PER_CONNECTION=2
export NCCL_BUFFSIZE=8388608
export UCX_NET_DEVICES=eno0

# Minimize uneccessary logs when running with Pyxis
export OMPI_MCA_btl=^openib
export PMIX_MCA_gds=hash
export UCX_WARN_UNUSED_ENV_VARS=n

srun --mpi=pmix \
  --container-writable \
  --container-name=rccl-pyxis-run \
  --container-image=${CONTAINER_IMAGE} \
  /usr/local/bin/rccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
sbatch --nodes=<number-of-nodes> rccl-pyxis.sbatch
squeue -u $USER
# You can set the image in the previous example to a local .sqsh file
CONTAINER_IMAGE='tensorwavehq+pytorch-bnxt+rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2.sqsh'