Slurm Quickstart

Slurm provides a multi-tenant framework for managing compute resources and jobs that span large clusters.

Overview

TensorWave Slurm combines the power of Slurm, the industry-standard workload manager for HPC and AI, with the flexibility of a Kubernetes-native orchestration layer. This integration delivers a modern, multi-tenant environment that scales seamlessly across AMD Instinct GPU clusters—enabling teams to run distributed training, fine-tuning, and simulation workloads without managing the underlying infrastructure.

With TensorWave, you get the familiar Slurm interface running on top of a cloud-native control plane that provides automated scheduling, easy scaling, and container-based execution.


Why Slurm on Kubernetes?

Traditional Slurm deployments were designed for static on-prem clusters. TensorWave modernizes that model by running Slurm inside Kubernetes, unlocking:

  • Scalable compute pools — resize your slurm cluster within your K8s environment.

  • Container-native workflows — integrate directly with your existing Docker or Enroot environments.

  • Multi-tenant isolation — each user or team runs in a secure namespace with defined resource limits.

The result is a unified, cloud-native scheduling experience that bridges HPC scalability with Kubernetes reliability.


Quickstart Example

1. Connect to Your Login Node

Each Slurm environment provides a login node, your interactive entry point for running Slurm commands. Behind the scenes, this login node runs as a managed Kubernetes pod, with the same Slurm interface (srun, sinfo, sbatch), and the benefits of cloud-native orchestration.

ssh <username>@<slurm-login-endpoint>

Once connected, you’ll have access to all standard Slurm utilities and your team’s partitions. From here, you can submit jobs, monitor queues, and launch multi-node workloads just as you would on a traditional HPC cluster.


2. Inspect Available Resources

List available partitions and node states:

Example:

Even though these nodes are dynamically managed by Kubernetes, the Slurm CLI remains identical to traditional HPC clusters.


3. Launch a Multi-Node Job

To verify connectivity and RDMA functionality, run a distributed RCCL test across four nodes (32 GPUs total):

Slurm automatically handles:

  • GPU and node allocation

  • Network interface binding

  • MPI coordination

Containers and More Info

Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.

Last updated