Slurm Quickstart

Slurm provides a multi-tenant framework for managing compute resources and jobs that span large clusters.

Overview

TensorWave Slurm combines the power of Slurm, the industry-standard workload manager for HPC and AI, with the flexibility of a Kubernetes-native orchestration layer. This integration delivers a modern, multi-tenant environment that scales seamlessly across AMD Instinct GPU clusters—enabling teams to run distributed training, fine-tuning, and simulation workloads without managing the underlying infrastructure.

With TensorWave, you get the familiar Slurm interface running on top of a cloud-native control plane that provides automated scheduling, easy scaling, and container-based execution.


Why Slurm on Kubernetes?

Traditional Slurm deployments were designed for static on-prem clusters. TensorWave modernizes that model by running Slurm inside Kubernetes, unlocking:

  • Scalable compute pools — resize your slurm cluster within your K8s environment.

  • Container-native workflows — integrate directly with your existing Docker or Enroot environments.

  • Multi-tenant isolation — each user or team runs in a secure namespace with defined resource limits.

The result is a unified, cloud-native scheduling experience that bridges HPC scalability with Kubernetes reliability.


Quickstart Example

1. Connect to Your Login Node

Each Slurm environment provides a login node, your interactive entry point for running Slurm commands. Behind the scenes, this login node runs as a managed Kubernetes pod, with the same Slurm interface (srun, sinfo, sbatch), and the benefits of cloud-native orchestration.

ssh <username>@<slurm-login-endpoint>

Once connected, you’ll have access to all standard Slurm utilities and your team’s partitions. From here, you can submit jobs, monitor queues, and launch multi-node workloads just as you would on a traditional HPC cluster.


2. Inspect Available Resources

List available partitions and node states:

sinfo

Example:

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpuworker*     up   infinite   1024   idle compute-[0-1023]

Even though these nodes are dynamically managed by Kubernetes, the Slurm CLI remains identical to traditional HPC clusters.


3. Launch a Multi-Node Job

To verify connectivity and RDMA functionality, run a distributed RCCL test across four nodes (32 GPUs total):

srun -N4 \
--mpi=pmix \
--ntasks-per-node=8 \
--gpus-per-node=8 \
--cpus-per-task=16 \
/usr/local/bin/rccl-tests/build/all_reduce_perf -b 32 -e 8g -f 2 -g 1

Slurm automatically handles:

  • GPU and node allocation

  • Network interface binding

  • MPI coordination

Containers and More Info

Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.

Last updated