Debugging Workloads and Interactive Jobs

Often dropping into a shell is the easiest way to hammer out development / debugging work. To get a shell access to a node, use a combination of salloc and srun commands.

[email protected]@slurm-login-skip-8566547b9c-zdznk:~$ salloc -N 1 --gpus-per-node=8
salloc: Granted job allocation 269
salloc: Waiting for resource configuration
salloc: Nodes tus1-p13-g41 are ready for job
[email protected]@slurm-login-skip-8566547b9c-zdznk:~$ srun --pty bash
[email protected]@tus1-p13-g41:~$ # <== bash shell on a worker node

salloc reserves a resource allocation. It creates a sub-shell tied to a SLRUM allocation, and allows users to call srun multiple times on the same set of resrouces. This provides much faster iteration time than calling srun/sbatch from a login node and having to wait for a resource allocation each time, especially for debugging multi-node jobs.

srun --pty bash launches the bash shell on the worker pods. The --pty flag sets up a psudoshell, forwarding stdin, stdout, and stderr to the terminal. The --pty flag is also useful for srun commands with complex outputs like docker build or python tools with tqdm progress bars.

Last updated