Healthchecks and SDebug

Overview of TensorWave's health checks and SDebug hardware debugging tool. This tool is designed to make cluster admins life easier w.r.t managing (un)healthy nodes.

  • NHC and TWHC health checks are frequently run by SLURM prolog + epilog scripts.

  • Use sdebug info to get a summary of the most recent health checks

  • Use sdebug info --node <hostname> to get in-depth health check details for a specific node


Healthchecks

TensorWave Slurm regularly runs health checks to ensure nodes are operating at peak performance. We've deployed LBNL's Node Health Checks (NHC)arrow-up-right, as well as our own in-house Tensor Wave Health Check (TWHC) suite. Checks are performed before and after any job is run using SLURM prolog and epilog, and any nodes that don't pass muster are automatically drained to avoid disrupting operations. All health check results are saved to disk and can be investigated through the accompanying sdebug tool.

Investigating Node Health with SDebug

Get a high-level overview of cluster health with sdebug info. This summarizes the results of the most recent health checks.

[email protected]@slurm-login-skip-8566547b9c-qsnbh:~$ sdebug info
check-ecc  check-gpu-reset  check-link-flap  check-lldpd  check-nhc  check-rdma-links  check-ulimits  NODES
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PASS       PASS             PASS             PASS         PASS       PASS              PASS           tus1-p13-g[1-6,8,9,11-23,25-26,28-30,32-43,45-50,52-54,56-59,62,64],tus1-p14-g[1-31,33-49,51-52,54-61,63-64],tus1-p16-g[1-6,9-17,19-20,24-30,33-36,39,42-50,52,54-62,64]
PASS       PASS             FAIL             PASS         PASS       PASS              PASS           tus1-p13-g[10,24,55],tus1-p16-g[21-22,37-38,53]
PASS       PASS             PASS             PASS         PASS       FAIL              PASS           tus1-p13-g[7,61],tus1-p16-g[7,23,41,51]
PASS       FAIL             PASS             PASS         PASS       PASS              PASS           tus1-p14-g50
PASS       PASS             PASS             PASS         PASS       PASS              FAIL           tus1-p14-g62,tus1-p16-g8

You can use the --node flag to dive deeper into a node's specific healthcheck report. This gives more detailed logs for the specific health checks that failed. Here's an example report from tus1-p14-g62 when its limits were set incorrectly:

[email protected]@slurm-login-skip-8566547b9c-qsnbh:~$ sdebug info --node tus1-p14-g62

Health Check Report for Node: tus1-p14-g62
=======================================================
TEST              STATUS  RETCODE  TIMESTAMP
-------------------------------------------------------
check-nhc         PASS    0        2026-02-16T23:47:06Z
check-ecc         PASS    0        2026-02-16T23:47:07Z
check-rdma-links  PASS    0        2026-02-16T23:47:07Z
check-link-flap   PASS    0        2026-02-16T23:47:07Z
check-ulimits     FAIL    1        2026-02-16T23:47:07Z
check-gpu-reset   PASS    0        2026-02-16T23:47:07Z
check-lldpd       PASS    0        2026-02-16T23:47:07Z
=======================================================

Failed Test Logs:
================================================================================

Test: check-ulimits
Return Code: 1
Timestamp: 2026-02-16T23:47:07Z
Log:
--------------------------------------------------------------------------------
FAIL: ulimit check(s) failed:
  - max_locked_memory: expected 'unlimited', got '8192'
  max_locked_memory=8192


================================================================================

Last updated