For the complete documentation index, see llms.txt. This page is also available as Markdown.

Prolog / Epilog

Slurm supports prolog and epilog scripts that run automatically on each worker pod at the start and end of every job. Prologs run before the first job step begins; epilogs run after the job completes or is cancelled.

Common uses include:

  • Verifying node health before a job starts

  • Cleaning up temporary files or resetting state after a job ends

  • Logging job metadata for monitoring or auditing

  • Enforcing site-specific policies around GPU, network, or filesystem state

Prolog and epilog scripts run as root on the worker pod, which gives them broad access but also means a failing or buggy script can affect the pod and any jobs running on it. Specifically, if a prolog or epilog script exits with a non-zero code, Slurm will place the node in DRAIN state, taking it out of service. Test scripts carefully before deploying them.

For full background on how Slurm handles prolog and epilog execution, see the Slurm Prolog and Epilog Guide.


Built-in scripts

TensorWave runs a set of managed prolog and epilog scripts on every job automatically. These handle node health checks (see Health Checks), GPU metrics collection for the dashboard, and dispatching your custom scripts. Your scripts always run after the built-in health checks.


Adding custom scripts

Custom prolog and epilog scripts go in the following directories on the shared storage volume:

Directory
When scripts run

/mnt/customer/prolog.d/

Before each job step, on every allocated node

/mnt/customer/epilog.d/

After each job completes, on every allocated node

Scripts are executed in lexicographic order by filename. Use numeric prefixes to control ordering:

Requirements

  • Scripts must be executable (chmod +x). Non-executable files are skipped with a warning in the log.

  • Scripts must include a shebang on the first line (#!/usr/bin/env bash).

  • Scripts run as root. A non-zero exit code will drain the node.

Example prolog script

Install it:

Example epilog script

Install it:


Viewing logs

Per-node prolog and epilog logs are written to:

Path
Contents

/mnt/customer/logs/prolog/<node>.log

Output from all prolog scripts on that node

/mnt/customer/logs/epilog/<node>.log

Output from all epilog scripts on that node

Each entry includes a timestamp, script path, exit code, job ID, and user. Script output is only captured in the log when the script fails.

Viewing a node's prolog log:

Example output for a successful prolog:

Example output when an epilog script fails (output is included):

If a script fails and the node drains, check the log for the affected node first, then inspect node state with sinfo:

Once the issue is resolved, contact your cluster administrator to resume the node.


Scripts in /mnt/customer/prolog.d and /mnt/customer/epilog.d are writable by administrators only (chmod 1700). Logs in /mnt/customer/logs are readable by all users.

Last updated