> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/jobs.md).

# Jobs

### Overview

Jobs on your cluster are scheduled and run using the Slurm workload manager. Resources can be allocated through two main mechanisms:

* **Interactive** (`salloc`): allocate resources and run commands directly, useful for development and debugging.
* **Batch** (`sbatch`): submit a script that runs when resources become available, the standard approach for production workloads.

Within an allocation you can then use the `srun` command to execute job steps and further distribute work across your allocation.

For command reference, see the [official Slurm documentation](https://slurm.schedmd.com/documentation.html).

***

### Inspecting the Cluster

Before we submit jobs, it's useful to understand how we can inspect the state of the cluster and queue.

#### [`sinfo`](https://slurm.schedmd.com/sinfo.html)

`sinfo` is the basic mechanism for inspecting the state of the resources available

```bash
sinfo
```

#### [`squeue`](https://slurm.schedmd.com/squeue.html)

`squeue` is how we can inspect the state of the queue. You can use the `-u` flag to check the status of your current running jobs.

```bash
squeue -u $USER
```

***

### Resource Specification

Though there are some differences in how resources are specified with the `salloc`, `sbatch`, and `srun` commands, the common directives are generally the same.

| Flag                | Description                          |
| ------------------- | ------------------------------------ |
| `--nodes` / `-N`    | Number of nodes                      |
| `--ntasks-per-node` | Number of tasks (processes) per node |
| `--gpus-per-node`   | GPUs per node                        |
| `--cpus-per-task`   | CPU cores per task                   |
| `--time`            | Wall-clock time limit (`HH:MM:SS`)   |
| `--nodelist`        | Run on specific nodes                |

> This is not a comprehensive list of the available flags, please reference Slurm documentation for full man pages.

All jobs must request GPU resources explicitly. Reasonable defaults are established to subdivide CPU cores and memory based on the GPU allocation, but these can be overridden per job.

### Job Steps with [`srun`](https://slurm.schedmd.com/srun.html)

Within the following allocation methods, it is useful to understand how job steps work and how they can be used to maximize allocations.

By default, `srun` commands will inherit the entire resource allocation for all subcommands. This is useful for submitting monolithic jobs, but can be tuned to instead subdivide resources within an allocation for multiple tasks. Consider the following examples in ways srun can be used from within an allocation:

```bash
# Single node, single task.
srun -n 1 -N 1 --gpus-per-node=8 python task.py

# Two nodes, two running instances of a single task.
srun -n 2 -N 2 --gpus-per-node=8 python task.py

# Two nodes, two different tasks running in parallel.
srun -n 1 -N 1 --gpus-per-node=8 --exclusive python task1.py &
srun -n 1 -N 1 --gpus-per-node=8 --exclusive python task2.py &
wait
```

### Interactive Allocation with [`salloc`](https://slurm.schedmd.com/salloc.html)

To launch an interactive allocation within Slurm, use the `salloc` command.

```bash
# Allocate 1 node with 8 GPUs
salloc -N 1 --gpus-per-node=8
```

The `salloc` session holds the allocation open. When you exit the `salloc` shell, the allocation is released and the node becomes available to other jobs.

For an interactive shell directly on the worker pod, you can add the following `srun`:

```bash
salloc -N 1 --gpus-per-node=8 srun --interactive --pty bash -l
```

To drop into an interactive shell inside a container, use `apptainer shell` with `srun` (don't forget the `--pty` flag):

```bash
srun -N 1 --gpus-per-node=8 --pty \
  apptainer shell docker://rocm/pytorch:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0
```

This allocates a worker pod, pulls (or uses a cached) container image, and drops you into an interactive shell inside it with GPUs available. Use a local `.sif` file instead of a `docker://` URI if you have already pulled the image. For more on building and running Apptainer images, including multi-node jobs and networking, see Containers and Modules.

### Batch Allocation with [`sbatch`](https://slurm.schedmd.com/sbatch.html)

Batch jobs are submitted with `sbatch` and run when the scheduler grants the allocation. Resource requests, environment setup, and the actual workload are all defined in the job script. Any `#SBATCH` directive in the script can also be overridden at submission time by passing the corresponding flag directly to `sbatch`.

#### Basic structure

```bash
#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00

srun my-program <my-program-args>
```

Submit with:

```bash
sbatch my-job.sh
```

View output once the job runs:

```bash
tail -f jid-<jobid>.name-my-job.log
```

#### Multi-node distributed training example

The pattern below works for PyTorch `torch.distributed.run` (torchrun) across multiple nodes. The first node allocated by Slurm acts as the rendezvous host.

```bash
#!/bin/bash
#SBATCH --job-name=ddp-training
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --gpus-per-node=8
#SBATCH --time=04:00:00

GPUS_PER_NODE=8
MASTER_ADDR=$(hostname)
MASTER_PORT=6000

srun bash <<EOF
LOCAL_ADDR=\$(hostname)
REMOTE_ADDR=${MASTER_ADDR}
IS_HOST=0
if [ "\$LOCAL_ADDR" == "${MASTER_ADDR}" ]; then
  IS_HOST=1
  REMOTE_ADDR=localhost
fi

export OMP_NUM_THREADS=8

python -u -m torch.distributed.run \
  --nproc_per_node $GPUS_PER_NODE \
  --nnodes $SLURM_NNODES \
  --rdzv_endpoint \${REMOTE_ADDR}:${MASTER_PORT} \
  --rdzv_backend c10d \
  --rdzv_id=1 \
  --rdzv_conf=is_host=\$IS_HOST \
  --local_addr "\$(hostname)" \
  train.py
EOF
```

Example scripts are available on the cluster under `/opt/tw/examples/libexec/`.

### SSH

Though it's recommended that you use the above mechanisms for submitting work, you are able to access any node within your allocation via SSH for debugging purposes.

```bash
# Get a list of allocated nodes to your jobs.
squeue -u $USER -o "%i %R"

# SSH to the node
ssh <node>
```

***

### MPI (PMIx)

The cluster uses **PMIx** as the default MPI launch interface (`MpiDefault=pmix`). Open MPI and other PMIx-compatible MPI libraries work with `srun` without needing `mpirun`.

The following environment is set cluster-wide and applied automatically to all login and compute sessions:

```bash
OMPI_MCA_btl_tcp_if_include=eno0,eno1   # Route TCP over front-end to avoid back-end contention
OMPI_MCA_btl=^openib                    # Disable legacy OpenIB transport
PMIX_MCA_gds=hash                       # Required PMIx GDS backend
```

#### Running an MPI job

Use `srun` directly. Slurm handles process launch and PMIx initialization:

```bash
#!/bin/bash
#SBATCH --job-name=mpi-job
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --time=01:00:00

srun ./my-mpi-program
```

We recommend using `srun` rather than `mpirun` or `mpiexec`, as it integrates directly with Slurm's process placement and PMIx initialization.

#### RCCL collective communication

For GPU collective benchmarks and validation, RCCL tests are pre-installed on compute nodes. A minimal all-reduce test across 2 nodes:

```bash
#!/bin/bash
#SBATCH --job-name=rccl-test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --time=00:05:00

export NCCL_IB_QPS_PER_CONNECTION=2
export NCCL_BUFFSIZE=8388608
export UCX_NET_DEVICES=eno0

srun /opt/rccl-tests/all_reduce_perf -b 512M -e 8G -f 2 -g 1
```

***

> For containerized jobs (Pyxis/Enroot, Apptainer) and environment modules, see Containers and Modules.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/jobs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
