For the complete documentation index, see llms.txt. This page is also available as Markdown.

Custom Prolog and Epilog scripts

Slurm's prolog and epilog features allow users to specify scripts to run on each node before the first job step/at job termination. This enables admins to perform tasks like system monitoring or cleanup in a non-disruptive way at fairly frequent intervals.

TensorWave's managed Slurm solution provides /mnt/customer/prolog.d and /mnt/customer/epilog.d for cluster admins to add their custom prolog/epilog scripts, respectively. Per-node summaries of custom prolog/epilog script runs are logged in /mnt/customer/logs.

Prolog/Epilog scripts are a powerful tool for managing a cluster, but there are also a few easy ways to 'shoot yourself in the foot' with them. If a Prolog/Epilog script returns a non-zero exit code, the node will be placed in DRAIN state, so if a buggy script is deployed, it can bring down the entire cluster. Prolog/Epilog scripts are run as the root user, this provides broad access for system monitoring, but can also enable disrupting running jobs if performing cleanup tasks.

Test example of a custom prolog

In this example, we have a prolog and an epilog script. Both scripts print an output to stdout. The epilog script returns 1 to simulate a failure event.

tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ sudo cat /mnt/customer/prolog.d/test-prolog-1.sh
#!/usr/bin/env bash
echo "Hello from job id $SLURM_JOB_ID on node $SLURM_NODENAME"

tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ sudo cat /mnt/customer/epilog.d/test-epilog-1.sh
#!/usr/bin/env bash
echo "Goodbye from job id $SLURM_JOB_ID on node $SLURM_NODENAME"
exit 1

To trigger the prolog/epilog, we submit an srun job. Since the test-epilog-1.sh 'fails', the node our job ran on drains.

tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ srun -N 1 --gpus-per-node=8 hostname
tus1-p2-g6

tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpuworker*    up   infinite      1  drain tus1-p2-g6
gpuworker*    up   infinite      1   idle tus1-p2-g5

Investigating logs, the prolog runs successfully, but since the epilog failed, the full output is saved in the logs.

Refrences

Slurm Prolog and Epilog Guide: https://slurm.schedmd.com/prolog_epilog.html

Last updated