# Healthchecks and SDebug

* NHC and TWHC health checks are frequently run by SLURM prolog + epilog scripts.
* Use `sdebug info` to get a summary of the most recent health checks
* Use `sdebug info --node <hostname>`  to get in-depth health check details for a specific node

***

#### Healthchecks

TensorWave Slurm regularly runs health checks to ensure nodes are operating at peak performance. We've deployed [LBNL's Node Health Checks (NHC)](https://github.com/mej/nhc), as well as our own in-house Tensor Wave Health Check (TWHC) suite. Checks are performed before and after any job is run using SLURM prolog and epilog, and any nodes that don't pass muster are automatically drained to avoid disrupting operations. All health check results are saved to disk and can be investigated through the accompanying `sdebug` tool.

#### Investigating Node Health with SDebug

Get a high-level overview of cluster health with `sdebug info`. This summarizes the results of the most recent health checks.

```
tensorwave@tensorwave.com@slurm-login-skip-8566547b9c-qsnbh:~$ sdebug info
check-ecc  check-gpu-reset  check-link-flap  check-lldpd  check-nhc  check-rdma-links  check-ulimits  NODES
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PASS       PASS             PASS             PASS         PASS       PASS              PASS           tus1-p13-g[1-6,8,9,11-23,25-26,28-30,32-43,45-50,52-54,56-59,62,64],tus1-p14-g[1-31,33-49,51-52,54-61,63-64],tus1-p16-g[1-6,9-17,19-20,24-30,33-36,39,42-50,52,54-62,64]
PASS       PASS             FAIL             PASS         PASS       PASS              PASS           tus1-p13-g[10,24,55],tus1-p16-g[21-22,37-38,53]
PASS       PASS             PASS             PASS         PASS       FAIL              PASS           tus1-p13-g[7,61],tus1-p16-g[7,23,41,51]
PASS       FAIL             PASS             PASS         PASS       PASS              PASS           tus1-p14-g50
PASS       PASS             PASS             PASS         PASS       PASS              FAIL           tus1-p14-g62,tus1-p16-g8
```

You can use the `--node` flag to dive deeper into a node's specific healthcheck report. This gives more detailed logs for the specific health checks that failed. Here's an example report from `tus1-p14-g62` when its limits were set incorrectly:

```
tensorwave@tensorwave.com@slurm-login-skip-8566547b9c-qsnbh:~$ sdebug info --node tus1-p14-g62

Health Check Report for Node: tus1-p14-g62
=======================================================
TEST              STATUS  RETCODE  TIMESTAMP
-------------------------------------------------------
check-nhc         PASS    0        2026-02-16T23:47:06Z
check-ecc         PASS    0        2026-02-16T23:47:07Z
check-rdma-links  PASS    0        2026-02-16T23:47:07Z
check-link-flap   PASS    0        2026-02-16T23:47:07Z
check-ulimits     FAIL    1        2026-02-16T23:47:07Z
check-gpu-reset   PASS    0        2026-02-16T23:47:07Z
check-lldpd       PASS    0        2026-02-16T23:47:07Z
=======================================================

Failed Test Logs:
================================================================================

Test: check-ulimits
Return Code: 1
Timestamp: 2026-02-16T23:47:07Z
Log:
--------------------------------------------------------------------------------
FAIL: ulimit check(s) failed:
  - max_locked_memory: expected 'unlimited', got '8192'
  max_locked_memory=8192


================================================================================
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/healthchecks-and-sdebug.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
