> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/healthchecks.md).

# Healthchecks

### Overview

TensorWave runs four layers of health checks to assure reliability. Passive checks are run continuously via a 5-minute 'heartbeat', as well as during job prolog + epilog. Active checks are scheduled every few days. Passive checks act as health gates: a failed check places the node in DRAIN state and it will not accept new jobs until the issue is resolved. Once the underlying problem is fixed, a passing check returns the node to service automatically. Nodes drained for reasons unrelated to health checks (for example, manual operator action) are not automatically resumed.

Healthcheck results are continuously logged, use `sdebug info` to read and summarize results for all checks across all nodes.

***

### Passive checks

Passive checks are quick and non-invasive checks that validate hardware is properly configured and in a known-good state.

#### Prolog and Epilog

Passive checks run on every job start and end via Slurm prolog and epilog. They are fast, non-disruptive, and cover hardware and configuration state:

* GPU presence, ECC error counts, and reset state
* RDMA link status, GID tables, and InfiniBand device presence
* Network interface state and recent link flap events
* Filesystem mounts and system resource limits
* Required daemons (SSSD, SSHD, LLDP)

#### Continuous checks

Slurm's `HealthCheckProgram` runs a health check on every node at a fixed wall-clock interval, independent of job activity. Continuous checks run the full passive check suite plus an **RDC health check**, which verifies that the ROCm Data Center daemon and underlying GPU hardware are in a known-good state. A failure drains the node the same way a prolog failure does. Once the issue is resolved, a subsequent passing continuous check will automatically return the node to service.

***

### Active checks

Active checks are heavier per-node tests submitted as Slurm jobs by a cron process on the controller. They run on a schedule against idle nodes and do not require a user job to be present:

* Single-node RCCL collective communication performance
* Multi-node training convergence against a reference baseline
* GPU stress, memory, and PCIe bandwidth and error tests

***

### Multinode checks

Multinode checks are initiated by an operator or user and do not produce a drain signal on their own. They are used for cluster validation and troubleshooting.

Available multinode checks include:

* **RCCL** — collective communication tests across node pairs to identify nodes with degraded interconnect performance
* **IB perf** — InfiniBand bandwidth tests between node pairs

```bash
# Run a shakedown across all nodes in a nodelist
sdebug run rccl --nodelist tus1-p14-g[1-64]
```

***

### sdebug

`sdebug` is the CLI for running manual checks and reviewing persisted results from all check layers.

**Manually run checks against a set of nodes:**

```bash
sdebug run -N 2 --nodelist tus1-p14-g[1-2] passive
```

**Run a specific set of checks in an `salloc` instance:**

```bash
salloc -N 2 --gpus-per-node=8
sdebug run --testlist ecc,rdma-links,gpu-count
```

**Run the active suite in an sbatch:**

```bash
#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8

# Validate node health
sdebug run passive

SDEBUG_ECODE=$?
if [[ "$SDEBUG_ECODE" != "0" ]]; then
    echo "Node failed healthcheck!"
    return $SDEBUG_ECODE
fi

srun pytorch train.py

```

#### sdebug info

`sdebug info` reads persisted results from disk and prints a summary across nodes, similar to `sinfo`. No allocation is required.

```bash
sdebug info
```

Filter by event type or hardware category:

```bash
sdebug info prolog        # prolog check results
sdebug info epilog        # epilog check results
sdebug info gpu           # GPU-related checks
sdebug info net           # network-related checks
sdebug info scheduled     # scheduled active check results
sdebug info timestamps    # last check time per node
```

Results are read from `/mnt/twhc/` and cover all event types: prolog, epilog, scheduled, and healthcheck runs.

To drill into a specific node:

```bash
sdebug info -n tus1-p14-g36
```

This shows the full health check history for that node across all event types, which is useful when investigating a drain or verifying a node after repair.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/healthchecks.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
