For the complete documentation index, see llms.txt. This page is also available as Markdown.

Healthchecks

TensorWave runs four layers of health checks to assure reliability. Passive checks are run continuously via a 5-minute 'heartbeat', as well as during job prolog + epilog. Active checks are scheduled every few days. Passive checks act as health gates: a failed check places the node in DRAIN state and it will not accept new jobs until the issue is resolved. Once the underlying problem is fixed, a passing check returns the node to service automatically. Nodes drained for reasons unrelated to health checks (for example, manual operator action) are not automatically resumed.

Healthcheck results are continusly logged, use sdebug info to read and summarize results for all checks across all nodes.


Passive checks

Passive checks are quick and non-invasive checks that validate hartware is properly configured and in a known-good state.

Prolog and Epilog

Passive checks run on every job start and end via Slurm prolog and epilog. They are fast, non-disruptive, and cover hardware and configuration state:

  • GPU presence, ECC error counts, and reset state

  • RDMA link status, GID tables, and InfiniBand device presence

  • Network interface state and recent link flap events

  • Filesystem mounts and system resource limits

  • Required daemons (SSSD, SSHD, LLDP)

Continuous checks

Slurm's HealthCheckProgram runs a health check on every node at a fixed wall-clock interval, independent of job activity. Continuous checks run the full passive check suite plus an RDC health check, which verifies that the ROCm Data Center daemon and underlying GPU hardware are in a known-good state. A failure drains the node the same way a prolog failure does. Once the issue is resolved, a subsequent passing continuous check will automatically return the node to service.


Active checks

Active checks are heavier per-node tests submitted as Slurm jobs by a cron process on the controller. They run on a schedule against idle nodes and do not require a user job to be present:

  • Single-node RCCL collective communication performance

  • Multi-node training convergence against a reference baseline

  • GPU stress, memory, and PCIe bandwidth and error tests


Multinode checks

Multinode checks are initiated by an operator or user and do not produce a drain signal on their own. They are used for cluster validation and troubleshooting.

Available multinode checks include:

  • RCCL — collective communication tests across node pairs to identify nodes with degraded interconnect performance

  • IB perf — InfiniBand bandwidth tests between node pairs


sdebug

sdebug is the CLI for running manual checks and reviewing persisted results from all check layers.

Manually run checks against a set of nodes:

Run a specific set of checks in an salloc instance:

Run the active suite in an sbatch:

sdebug info

sdebug info reads persisted results from disk and prints a summary across nodes, similar to sinfo. No allocation is required.

Filter by event type or hardware category:

Results are read from /mnt/twhc/ and cover all event types: prolog, epilog, scheduled, and healthcheck runs.

To drill into a specific node:

This shows the full health check history for that node across all event types, which is useful when investigating a drain or verifying a node after repair.


Last updated