> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/slurm-overview.md).

# Slurm Overview

TensorWave Managed Slurm is a production Slurm cluster deployed on the same Kubernetes infrastructure as your other GPU workloads. TensorWave operates and supports the cluster; your teams continue to use the same commands they always have: `sbatch`, `squeue`, `salloc`, and `srun`. We manage the Slurm control plane, compute and login images, networking and storage integration, and upgrades, removing the need to run a separate Slurm fleet alongside your Kubernetes cluster.

***

### Slurm on Kubernetes

Many organizations use **Kubernetes** for services, inference, CI, and other cloud-native workloads, and **Slurm** for large batch jobs, MPI, and distributed training. These are complementary scheduling models, not alternatives to each other.

The traditional approach requires operating them as two separate systems: independent node imaging pipelines, separate monitoring stacks, and custom integration between a Kubernetes fleet and a standalone Slurm deployment running on the same physical hardware.

**Slurm on Kubernetes** eliminates that separation. The Slurm cluster is deployed on the same Kubernetes cluster as your other workloads. Batch users continue to submit jobs through the standard Slurm interface. Kubernetes is not replacing Slurm for batch scheduling. It is the deployment and operations layer for the Slurm controller, login pods, and workers, using the same storage, secrets, and observability patterns applied to everything else on the cluster.

|                       | Standalone Slurm                                  | Slurm on Kubernetes                                            |
| --------------------- | ------------------------------------------------- | -------------------------------------------------------------- |
| **Compute**           | Separate machines or OS images outside Kubernetes | Slurm daemons running in a Kubernetes Pod on GPU nodes         |
| **Service workloads** | Separate cluster                                  | Same cluster, via Kubernetes                                   |
| **Node lifecycle**    | Bare-metal images and manual intervention         | Workers as versioned container images, managed by the operator |
| **Failure handling**  | Custom scripts and manual drain procedures        | Operator reconciliation integrated with Slurm drain and resume |
| **Observability**     | Separate tooling from the rest of the cluster     | Shared metrics and logging pipelines                           |

Batch and research users retain the Slurm interface they know. Platform teams retain Kubernetes for the workloads that belong there.

***

### How it works

The Slurm cluster consists of five components, each running as a workload on Kubernetes:

**Controller.** Runs `slurmctld`, which handles job submissions, scheduling, and node state. Configuration is distributed to clients automatically so login and compute pods always share a consistent `slurm.conf`.

**Login pods.** The entry point for SSH sessions and interactive use. Users submit jobs, inspect the queue, and (where policy permits) connect to workers allocated to their jobs.

**Worker pods.** Run `slurmd` on GPU and CPU hardware. Each worker is a container with the devices and network fabric the job requires, including RDMA NICs, local scratch, and access to shared storage.

**Operator.** Reconciles a declarative cluster specification (Kubernetes custom resources) against the live state of the deployment. Changes to capacity, images, or partitions are applied by updating the specification; the operator handles the rest.

**Storage.** Shared filesystems (including /home for users and /mnt/\* for system resources) are mounted on both login and worker pods so that paths are consistent at submission time and at runtime. Worker scratch storage is ephemeral and sized for container runtimes and job-local I/O.

From a user perspective, `sinfo` and `scontrol` report standard Slurm node names. Operationally, workers are pods, running with host networking and the privileged device access that GPUs and RDMA require.

***

### What TensorWave manages

You receive a fully operational Slurm on Kubernetes deployment without building or maintaining the integration yourself. TensorWave provides:

* **Compute and login images.** Slurm 25.x, PMIx MPI, the GPU software stack, and (where enabled) Pyxis/Enroot and Apptainer, built and validated against your hardware profile.
* **Access.** LDAP-backed SSH with optional per-user login pods, and documented procedures for worker access within your security policy.
* **Accounting.** Job and resource usage records are stored via `slurmdbd` backed by a database. GPU usage is tracked per job, and associations, limits, and QOS policies are enabled.
* **Storage.** `/home` and shared module paths mounted consistently across login and compute pods.
* **Health checks.** Automated prolog and epilog checks, scheduled checks for deeper validation, and additional tools for node diagnostics.
* **Observability.** Slurm metrics (jobs, nodes, partitions, scheduler) are pushed to the TensorWave platform, where they are available in a Slurm dashboard. This covers job queue depth, node states, GPU utilization per job, and scheduler activity without any additional setup on your end.
* **Lifecycle management.** Coordinated upgrades, partition configuration, and change communication so maintenance does not catch users off guard.

***

### What stays the same

If your team already uses Slurm, the job submission interface is unchanged:

* Job submission: `sbatch`, `srun`, `salloc`
* Cluster inspection: `sinfo`, `squeue`, `scontrol`
* MPI workloads, container-based jobs, and environment modules where installed

For standard Slurm behavior and command reference, the [official Slurm documentation](https://slurm.schedmd.com/documentation.html) remains authoritative. The documentation here covers what is specific to your TensorWave deployment: access setup, storage layout, container support, health checks, and support scope.

***

### Before you begin

A few characteristics of this architecture are worth understanding before you start:

* **Worker pods are containers.** They are not long-lived OS instances. Local scratch is ephemeral and will not persist across pod restarts.
* **Privileged device access is required.** GPU, RDMA, and most container runtime support on worker pods requires elevated privileges. This is standard for GPU workloads on Kubernetes and should be accounted for in your security review.
* **Network and topology configuration is site-specific.** NIC selection, NCCL/RCCL tuning, and multi-node topology settings depend on your hardware. Your deployment guide contains the values for your cluster.

***

### Where to go next

| Topic                 | Description                                |
| --------------------- | ------------------------------------------ |
| Accessing the cluster | LDAP, SSH, login layout, worker access     |
| Storage               | Shared home, scratch, shared project space |
| Running Jobs          | First `sinfo` / `sbatch`, validation jobs  |
| Health and monitoring | Checks, drains, metrics, `sdebug`          |
| Prolog and epilog     | Custom hooks and customer scripts          |
| Common Issues         | FAQ and patterns often used with Slurm     |

***

*Hostnames, filesystem names, and cluster-specific configuration are documented in your deployment guide.*


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/slurm-overview.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
