> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/prolog-epilog.md).

# Prolog / Epilog

### Overview

Slurm supports prolog and epilog scripts that run automatically on each worker pod at the start and end of every job. Prologs run upon allocation; epilogs run after the job completes or is cancelled.

Common uses include:

* Verifying node health before a job starts
* Cleaning up temporary files or resetting state after a job ends
* Logging job metadata for monitoring or auditing
* Enforcing site-specific policies around GPU, network, or filesystem state

Prolog and epilog scripts run as **root** on the worker pod, which gives them broad access but also means a failing or buggy script can affect the pod and any jobs running on it. Specifically, **if a prolog or epilog script exits with a non-zero code, Slurm will place the node in DRAIN state**, taking it out of service. Test scripts carefully before deploying them.

For full background on how Slurm handles prolog and epilog execution, see the [Slurm Prolog and Epilog Guide](https://slurm.schedmd.com/prolog_epilog.html).

***

### Built-in scripts

TensorWave runs a set of managed prolog and epilog scripts on every job automatically. These handle node health checks (see Health Checks), GPU metrics collection for the dashboard, and dispatching your custom scripts. Your scripts always run after the built-in health checks.

***

### Adding custom scripts

Custom prolog and epilog scripts go in the following directories on the shared storage volume:

| Directory                 | Script Execution                                  |
| ------------------------- | ------------------------------------------------- |
| `/mnt/customer/prolog.d/` | Upon allocation, on every allocated node          |
| `/mnt/customer/epilog.d/` | After each job completes, on every allocated node |

Scripts are executed in **lexicographic order** by filename. Use numeric prefixes to control ordering, and use leading zeroes if necessary to ensure accurate sorting:

```
/mnt/customer/prolog.d/
  01-check.sh
  10-check-something-else.sh
  20-setup-environment.sh
  99-final-step.sh
```

#### Requirements

* Scripts must be **executable** (`chmod +x`). Non-executable files are skipped with a warning in the log.
* Scripts must include a **shebang** on the first line (`#!/usr/bin/env bash`).
* Scripts run as **root**. A non-zero exit code will drain the node.

#### Example prolog script

```bash
#!/usr/bin/env bash
echo "Hello from job $SLURM_JOB_ID on node $SLURMD_NODENAME"
```

Install it:

```bash
sudo cp my-prolog.sh /mnt/customer/prolog.d/10-my-prolog.sh
sudo chmod +x /mnt/customer/prolog.d/10-my-prolog.sh
```

#### Example epilog script

```bash
#!/usr/bin/env bash
echo "Goodbye from job $SLURM_JOB_ID on node $SLURMD_NODENAME"
# Clean up any job-specific scratch
rm -rf /tmp/job-${SLURM_JOB_ID}
```

Install it:

```bash
sudo cp my-epilog.sh /mnt/customer/epilog.d/10-my-epilog.sh
sudo chmod +x /mnt/customer/epilog.d/10-my-epilog.sh
```

***

### Viewing logs

Per-node prolog and epilog logs are written to:

| Path                                   | Contents                                    |
| -------------------------------------- | ------------------------------------------- |
| `/mnt/customer/logs/prolog/<node>.log` | Output from all prolog scripts on that node |
| `/mnt/customer/logs/epilog/<node>.log` | Output from all epilog scripts on that node |

Each entry includes a timestamp, script path, exit code, job ID, and user. Script output is only captured in the log when the script fails.

**Viewing a node's prolog log:**

```bash
cat /mnt/customer/logs/prolog/tus1-p2-g6.log
```

Example output for a successful prolog:

```
timestamp=2026-04-01T18:43:39Z script=/mnt/customer/prolog.d/10-my-prolog.sh exit_code=0 job_id=5995 job_user=user@example.com
```

Example output when an epilog script fails (output is included):

```
timestamp=2026-04-01T18:43:42Z script=/mnt/customer/epilog.d/10-my-epilog.sh exit_code=1 job_id=5995 job_user=user@example.com
--- output ---
Goodbye from job 5995 on node
```

If a script fails and the node drains, check the log for the affected node first, then inspect node state with `sinfo`:

```bash
sinfo -n tus1-p2-g6
```

Once the issue is resolved, contact your cluster administrator to resume the node.

***

> Scripts in `/mnt/customer/prolog.d` and `/mnt/customer/epilog.d` are writable by administrators only (`chmod 1700`). Logs in `/mnt/customer/logs` are readable by all users.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/prolog-epilog.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
