For the complete documentation index, see llms.txt. This page is also available as Markdown.

Jobs

Jobs are submitted and managed through standard Slurm commands. There are two primary modes:

  • Interactive (srun, salloc): allocate resources and run commands directly, useful for development and debugging.

  • Batch (sbatch): submit a script that runs when resources become available, the standard approach for production workloads.

All jobs must request GPU resources explicitly. The default partition is configured with reasonable CPU-per-GPU and mem-per-CPU defaults, but these can be overridden per job. For command reference, see the official Slurm documentation.


srun

srun is a multi-node job launcher. When run from a login pod or a salloc session, it allocates resources, launches the command directly, and waits for it to complete. This is useful for quick one-off commands and interactive debugging.

When called from within an sbatch script, srun executes a job step against the resources already allocated to the batch job. Multiple srun calls can be sequenced in a single script to perform complex multi-step workloads.

Run a command on one node with 8 GPUs:

srun -N 1 --gpus-per-node=8 amd-smi

Open an interactive shell:

srun -N 1 --gpus-per-node=8 --pty bash

Run a command across multiple nodes:

srun -N 4 --gpus-per-node=8 hostname

srun inside a batch script or salloc session inherits resources from the allocation and runs across all of them by default. Resource flags can be used to target a subset of the allocated resources, but requesting more than was allocated will cause an error.


Batch Jobs

Batch jobs are submitted with sbatch and run when the scheduler grants the allocation. Resource requests, environment setup, and the actual workload are all defined in the job script. Any #SBATCH directive in the script can also be overridden at submission time by passing the corresponding flag directly to sbatch.

Basic structure

Submit with:

Check status:

View output once the job runs:

Common resource flags

Flag
Description

--nodes / -N

Number of nodes

--ntasks-per-node

Number of tasks (processes) per node

--gpus-per-node

GPUs per node

--cpus-per-task

CPU cores per task

--time

Wall-clock time limit (HH:MM:SS)

--nodelist

Run on specific nodes

Multi-node distributed training example

The pattern below works for PyTorch torch.distributed.run (torchrun) across multiple nodes. The first node allocated by Slurm acts as the rendezvous host.

Example scripts are available on the cluster under /opt/tw/examples/libexec/.


MPI (PMIx)

The cluster uses PMIx as the default MPI launch interface (MpiDefault=pmix). Open MPI and other PMIx-compatible MPI libraries work with srun without needing mpirun.

The following environment is set cluster-wide and applied automatically to all login and compute sessions:

Running an MPI job

Use srun directly. Slurm handles process launch and PMIx initialization:

We recommend using srun rather than mpirun or mpiexec, as it integrates directly with Slurm's process placement and PMIx initialization.

RCCL collective communication

For GPU collective benchmarks and validation, RCCL tests are pre-installed on compute nodes. A minimal all-reduce test across 2 nodes:


For containerized jobs (Pyxis/Enroot, Apptainer) and environment modules, see Containers and Modules.

Last updated