# Custom Prolog and Epilog scripts

Slurm's prolog and epilog features allow users to specify scripts to run on each node before the first job step/at job termination. This enables admins to perform tasks like system monitoring or cleanup in a non-disruptive way at fairly frequent intervals.

TensorWave's managed Slurm solution provides `/mnt/customer/prolog.d` and `/mnt/customer/epilog.d` for cluster admins to add their custom prolog/epilog scripts, respectively. Per-node summaries of custom prolog/epilog script runs are logged in `/mnt/customer/logs`.&#x20;

Prolog/Epilog scripts are a powerful tool for managing a cluster, but there are also a few easy ways to 'shoot yourself in the foot' with them. If a Prolog/Epilog script returns a non-zero exit code, the node will be placed in DRAIN state, so if a buggy script is deployed, it can bring down the entire cluster. Prolog/Epilog scripts are run as the root user, this provides broad access for system monitoring, but can also enable disrupting running jobs if performing cleanup tasks.

#### Test example of a custom prolog

In this example, we have a prolog and an epilog script. Both scripts print an output to stdout. The epilog script returns 1 to simulate a failure event.

```bash
tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ sudo cat /mnt/customer/prolog.d/test-prolog-1.sh
#!/usr/bin/env bash
echo "Hello from job id $SLURM_JOB_ID on node $SLURM_NODENAME"

tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ sudo cat /mnt/customer/epilog.d/test-epilog-1.sh
#!/usr/bin/env bash
echo "Goodbye from job id $SLURM_JOB_ID on node $SLURM_NODENAME"
exit 1
```

To trigger the prolog/epilog, we submit an `srun` job. Since the `test-epilog-1.sh` 'fails', the node our job ran on drains.

```bash
tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ srun -N 1 --gpus-per-node=8 hostname
tus1-p2-g6

tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpuworker*    up   infinite      1  drain tus1-p2-g6
gpuworker*    up   infinite      1   idle tus1-p2-g5
```

Investigating logs, the prolog runs successfully, but since the epilog failed, the full output is saved in the logs.&#x20;

```bash
tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ cat /mnt/customer/logs/prolog/tus1-p2-g6.log
timestamp=2026-04-01T18:43:39Z script=/mnt/customer/prolog.d/test-prolog-1.sh exit_code=0 job_id=5995 job_user=tensorwave@tensorwave.com

tensorwave@tensorwave.com@slurm-login-skip-849dbcf5c-q7ffr:~$ cat /mnt/customer/logs/epilog/tus1-p2-g6.log
timestamp=2026-04-01T18:43:42Z script=/mnt/customer/epilog.d/test-epilog-1.sh exit_code=1 job_id=5995 job_user=tensorwave@tensorwave.com
--- output ---
Goodbye from job id 5995 on node
```

#### Refrences

Slurm Prolog and Epilog Guide: <https://slurm.schedmd.com/prolog_epilog.html>
