# Slurm

### Overview

TensorWave Slurm combines the power of **Slurm**, the industry-standard workload manager for HPC and AI, with the flexibility of a **Kubernetes-native orchestration layer**.\
This integration delivers a modern, multi-tenant environment that scales seamlessly across AMD Instinct GPU clusters—enabling teams to run distributed training, fine-tuning, and simulation workloads without managing the underlying infrastructure.

With TensorWave, you get the familiar Slurm interface running on top of a cloud-native control plane that provides automated scheduling, easy scaling, and container-based execution.

***

### Why Slurm on Kubernetes?

Traditional Slurm deployments were designed for static on-prem clusters.\
TensorWave modernizes that model by running Slurm **inside Kubernetes**, unlocking:

* **Scalable compute pools** — resize your slurm cluster within your K8s environment.
* **Container-native workflows** — integrate directly with your existing Docker or Enroot environments.
* **Multi-tenant isolation** — each user or team runs in a secure namespace with defined resource limits.

The result is a unified, cloud-native scheduling experience that bridges HPC scalability with Kubernetes reliability.

***

### Quickstart Example

#### **1. Connect to Your Login Node**

Each Slurm environment provides a **login node**, your interactive entry point for running Slurm commands. Behind the scenes, this login node runs as a managed **Kubernetes pod**, with the same Slurm interface (`srun`, `sinfo`, `sbatch`), and the benefits of cloud-native orchestration.

```bash
ssh <username>@<slurm-login-endpoint>
```

Once connected, you’ll have access to all standard Slurm utilities and your team’s partitions.\
From here, you can submit jobs, monitor queues, and launch multi-node workloads just as you would on a traditional HPC cluster.

***

#### 2. Inspect Available Resources

List available partitions and node states:

```bash
sinfo
```

Example:

```
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpuworker*     up   infinite   1024   idle compute-[1-1024]
```

Even though these nodes are dynamically managed by Kubernetes, the Slurm CLI remains identical to traditional HPC clusters.

***

#### 3. Launch a Multi-Node Job

To verify connectivity and RDMA functionality, run a distributed RCCL test across four nodes (32 GPUs total):

```bash
srun -N4 \
--gpus-per-node=8 \
/opt/rccl-tests/all_reduce_perf -g 8 -b 1g -e 16G -f 2
```

Slurm automatically handles:

* GPU and node allocation
* Network interface binding
* MPI coordination

### Containers and More Info

Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
